Airflow Takes Over Gallery-DL Download Tasks

Yesterday, when I was using Gallery-DL to batch download images, I still felt it wasn't convenient enough. I remember using Airflow when backing up data for my own systems at work. I could send backup results via Lark, which made my work worry-free.

Actually, repetitive tasks like batch image downloads can also be run in Airflow. It's flexible and customizable. The process is documented below.

1. Installing Airflow

Airflow is now an ASF (Apache Software Foundation) project; the world changes so fast!

The following operations are performed on Arch Linux:

Creating a Virtual Environment

Create and activate a virtual environment in the ~/airflow directory.

1➜ airflow pwd
2/home/mephisto/airflow
3➜ airflow uv venv
4➜ airflow source .venv/bin/activate

Install Airflow according to the official documentation

Here, we're installing the latest version, 3.1.5. Released just three weeks ago, it's fresh off the press.

1AIRFLOW_VERSION=3.1.5
2PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
3CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
4uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Running in standalone mode

1airflow standalone

Access localhost:8080 in your browser and note the username/password in the command output (used for login).

You can also change the user password yourself by directly modifying thesimple_auth_manager_passwords.json.generated file.

1➜ airflow cat simple_auth_manager_passwords.json.generated
2{"admin": "admin"}

This means the user is admin, and the password is admin. This is for a test environment.

2. Write Airflow dag

Let’s look directly at the example:

 1  airflow cat dags/download_twitter_media.py
 2from datetime import datetime, timedelta
 3from airflow import DAG
 4from airflow.providers.standard.operators.bash import BashOperator
 5from airflow.models import Variable
 6import json
 7
 8# 从 Airflow Variable 获取用户列表(JSON 格式字符串)
 9USERS_JSON = Variable.get("twitter_users", default_var='["nasa"]')
10try:
11    USERS = json.loads(USERS_JSON)
12except json.JSONDecodeError:
13    raise ValueError("Variable 'twitter_users' must be a valid JSON array of strings.")
14
15DOWNLOAD_BASE_PATH = Variable.get("twitter_download_path", default_var="~/Pictures/twitter")
16
17default_args = {
18    "owner": "data_team",
19    "depends_on_past": False,
20    "email_on_failure": False,
21    "retries": 1,
22    "retry_delay": timedelta(minutes=5),
23}
24
25with DAG(
26    dag_id="download_twitter_multiple_users",
27    default_args=default_args,
28    description="Download media from multiple Twitter/X users using gallery-dl",
29    schedule="@daily",
30    start_date=datetime(2025, 1, 1),
31    catchup=False,
32    tags=["social_media", "twitter", "multi_user"],
33) as dag:
34
35    prev_task = None # 设定串行,防止被封ip
36
37    # 动态为每个用户生成任务
38    for user in USERS:
39        user = user.strip()
40        if not user:
41            print("no twitter user , set at 'Airflow/admin/variables'")
42
43        #mkdir_cmd = f"mkdir -p {DOWNLOAD_BASE_PATH}/{user}"
44
45        # gallery-dl 命令(增量下载)
46        gallery_dl_cmd = (
47            f"gallery-dl "
48            f"--cookies-from-browser firefox "
49            f"--download-archive {DOWNLOAD_BASE_PATH}/{user}/archive.txt "
50            f"https://x.com/{user}/media"
51        )
52
53        #create_dir = BashOperator(
54        #    task_id=f"create_dir_{user}",
55        #    bash_command=mkdir_cmd,
56        #)
57
58        download_task = BashOperator(
59            task_id=f"download_{user}",
60            bash_command=gallery_dl_cmd,
61        )
62
63        # 串行:上一个任务完成后才执行当前任务
64
65        if prev_task:
66            prev_task >> download_task
67        prev_task = download_task
68
69        # 设置依赖:先建目录,再下载
70        #create_dir >> download_task

Today I added a download archive for incremental downloads. It's actually an SQLite database file; the filename can be specified.

1➜ file archive.txt
2archive.txt: SQLite 3.x database, last written using SQLite version 3051001, file counter 1, database pages 2, cookie 0x1, schema 4, UTF-8, version-valid-for 1

The twitter_download_path and twitter_download_path above are read from Airflow's variables, and their types are string and array, respectively.

Next time, I can directly modify the user and target path in the Airflow web interface, which is very convenient and practical.

If you encounter a DAG import error, a message will appear on the Airflow homepage. Click to view the error and fix it.

3. Screenshot Example

A picture is worth a thousand words:

DAG

airflow_dag

Variable Settings

Completed on the web.

airflow_variables_1

airflow_variables_2

Task Details

The screenshot is very clear; no further explanation is needed. See the log on the right; it has the same effect as manually executing it in the terminal.

airflow_task

This way, it can run automatically every day. If you have a NAS, you can deploy a similar system on it to automatically download and search, and even add a function to send result notifications. It's quite good, and the logic can be modified however you like (for example, one DAG per user, or parallel processing, etc. I specifically changed it to serial processing to avoid IP blocking).

Furthermore, gallery-dl supports many platforms and can download from almost all common content websites. There are many similar tools available; if you don't like them, find an alternative, choosing one with a strong community and frequent updates.

Lastmod: Thursday, January 1, 2026

See Also:

Translations: