Airflow Takes Over Gallery-DL Download Tasks
Yesterday, when I was using Gallery-DL to batch download images, I still felt it wasn't convenient enough. I remember using Airflow when backing up data for my own systems at work. I could send backup results via Lark, which made my work worry-free.
Actually, repetitive tasks like batch image downloads can also be run in Airflow. It's flexible and customizable. The process is documented below.
1. Installing Airflow
Airflow is now an ASF (Apache Software Foundation) project; the world changes so fast!
The following operations are performed on Arch Linux:
Creating a Virtual Environment
Create and activate a virtual environment in the ~/airflow directory.
1➜ airflow pwd
2/home/mephisto/airflow
3➜ airflow uv venv
4➜ airflow source .venv/bin/activate
Install Airflow according to the official documentation
Here, we're installing the latest version, 3.1.5. Released just three weeks ago, it's fresh off the press.
1AIRFLOW_VERSION=3.1.5
2PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
3CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
4uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Running in standalone mode
1airflow standalone
Access localhost:8080 in your browser and note the username/password in the command output (used for login).
You can also change the user password yourself by directly modifying thesimple_auth_manager_passwords.json.generated file.
1➜ airflow cat simple_auth_manager_passwords.json.generated
2{"admin": "admin"}
This means the user is admin, and the password is admin. This is for a test environment.
2. Write Airflow dag
Let’s look directly at the example:
1➜ airflow cat dags/download_twitter_media.py
2from datetime import datetime, timedelta
3from airflow import DAG
4from airflow.providers.standard.operators.bash import BashOperator
5from airflow.models import Variable
6import json
7
8# 从 Airflow Variable 获取用户列表(JSON 格式字符串)
9USERS_JSON = Variable.get("twitter_users", default_var='["nasa"]')
10try:
11 USERS = json.loads(USERS_JSON)
12except json.JSONDecodeError:
13 raise ValueError("Variable 'twitter_users' must be a valid JSON array of strings.")
14
15DOWNLOAD_BASE_PATH = Variable.get("twitter_download_path", default_var="~/Pictures/twitter")
16
17default_args = {
18 "owner": "data_team",
19 "depends_on_past": False,
20 "email_on_failure": False,
21 "retries": 1,
22 "retry_delay": timedelta(minutes=5),
23}
24
25with DAG(
26 dag_id="download_twitter_multiple_users",
27 default_args=default_args,
28 description="Download media from multiple Twitter/X users using gallery-dl",
29 schedule="@daily",
30 start_date=datetime(2025, 1, 1),
31 catchup=False,
32 tags=["social_media", "twitter", "multi_user"],
33) as dag:
34
35 prev_task = None # 设定串行,防止被封ip
36
37 # 动态为每个用户生成任务
38 for user in USERS:
39 user = user.strip()
40 if not user:
41 print("no twitter user , set at 'Airflow/admin/variables'")
42
43 #mkdir_cmd = f"mkdir -p {DOWNLOAD_BASE_PATH}/{user}"
44
45 # gallery-dl 命令(增量下载)
46 gallery_dl_cmd = (
47 f"gallery-dl "
48 f"--cookies-from-browser firefox "
49 f"--download-archive {DOWNLOAD_BASE_PATH}/{user}/archive.txt "
50 f"https://x.com/{user}/media"
51 )
52
53 #create_dir = BashOperator(
54 # task_id=f"create_dir_{user}",
55 # bash_command=mkdir_cmd,
56 #)
57
58 download_task = BashOperator(
59 task_id=f"download_{user}",
60 bash_command=gallery_dl_cmd,
61 )
62
63 # 串行:上一个任务完成后才执行当前任务
64
65 if prev_task:
66 prev_task >> download_task
67 prev_task = download_task
68
69 # 设置依赖:先建目录,再下载
70 #create_dir >> download_task
Today I added a download archive for incremental downloads. It's actually an SQLite database file; the filename can be specified.
1➜ file archive.txt
2archive.txt: SQLite 3.x database, last written using SQLite version 3051001, file counter 1, database pages 2, cookie 0x1, schema 4, UTF-8, version-valid-for 1
The twitter_download_path and twitter_download_path above are read from Airflow's variables, and their types are string and array, respectively.
Next time, I can directly modify the user and target path in the Airflow web interface, which is very convenient and practical.
If you encounter a DAG import error, a message will appear on the Airflow homepage. Click to view the error and fix it.
3. Screenshot Example
A picture is worth a thousand words:
DAG
Variable Settings
Completed on the web.
Task Details
The screenshot is very clear; no further explanation is needed. See the log on the right; it has the same effect as manually executing it in the terminal.
This way, it can run automatically every day. If you have a NAS, you can deploy a similar system on it to automatically download and search, and even add a function to send result notifications. It's quite good, and the logic can be modified however you like (for example, one DAG per user, or parallel processing, etc. I specifically changed it to serial processing to avoid IP blocking).
Furthermore, gallery-dl supports many platforms and can download from almost all common content websites. There are many similar tools available; if you don't like them, find an alternative, choosing one with a strong community and frequent updates.
Copyright statement:
- All content that is not sourced is original., please do not reprint without authorization (because the typesetting is often disordered after reprinting, the content is uncontrollable, and cannot be continuously updated, etc.);
- For non-profit purposes, to deduce any content of this blog, please give the relevant webpage address of this site in the form of 'source of original text' or 'reference link' (for the convenience of readers).
See Also:
- Fastapi simply implements temporary download links
- How to Batch Download Images Using gallery-dl
- How to play Dota2 and CS2 happily under Arch core display
- Arch/labwc Environment Network Related Settings
- Rime Adds Dota 2 Dictionary
- Supertuxkart iOS version finally released
- Recommended Clipboard Management Tool: Clipcat
- WeChat Mini Program Development Notes
- Rofi Trial
- Solve the problem that VSCode cannot input Chinese under Arch