September 22 2022 • 7 min read

Airflow Docker in Docker with Docker Operator

What's Airflow?

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows [1]. You can create a workflows as code which are called DAGs (Directed acyclic graphs).

A DAG is a collection of tasks, organized with dependencies and relationships to say how they should run [2]

What are Providers Packages?

Apache Airflow was built to be modular, so that the core provides the basic tasks and operators, but can be extended by the installation of third party projects, called providers.

Providers could contain operators, hooks, etc for the communication to external systems. Operators, which is a template for a predefined Task, can be declaritevely defined within a DAG. Examples of core operators would be the BashOperator or the PythonOperator.

We would be installing the Docker Provider and use the DockerOperator in this example with Airflow running in Docker.

Why use Docker?

There are several reasons why you would want to use Docker for containerizing your workloads.

Portability: Containers enable developers to build and test applications on their own machines, and then easily deploy the same application to different environments, such as staging or production, without worrying about differences in the underlying infrastructure.
Isolation: Containers provide isolation between the application and the underlying operating system and infrastructure, which can help to reduce conflicts and dependencies between different applications and environments.
Ease of use: Docker provides a simple and consistent way to package, distribute, and deploy applications, making it easier for developers to build and manage applications at scale.
Scalability: Docker allows organizations to easily scale applications up or down by adding or removing containers as needed, which can be useful for applications that experience variable workloads or spikes in traffic.
Resource efficiency: Containers are lightweight and use resources more efficiently than traditional virtual machines, which can help to reduce costs and improve performance.

You can therefore package all the dependencies inside the container, and Airflow would schedule running the container.

Note that it's also possible to use the BashOperator to run Docker images (with docker run) when we use this setup, but I thought it would be nice to try out the DockerOperator in this case.

Airflow Docker Quickstart

When I started using Airflow (from v1) it was a little tricky for newcomers to get started, but since v2 the documentation improved to such a degree, that it became really easy to get set up in minutes. All the information on how to do a full docker-compose installation is available on Airflow's how to section. Using docker-compose is a great way to learn and test, but it is not recommended to run in production if you don't have specialized expertise in Docker & Docker Compose. In a seperate post I will describe how to use the official Helm chart and deploy Airflow on a Kubernetes cluster, but sometimes (especially for internal company projects in Germany), you won't be given a Kubernetes cluster 😄 then setting up Airflow with docker-compose is a possible alternative.

I will just repeat the useful commands. focusing on a debian-based Linux distro with docker already installed (the most common case after all).

We create a folder and get the official docker-compose file from airflow.

bash
mkdir airflow-docker
cd airflow-docker
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.4.0/docker-compose.yaml'

Extending Airflow's Docker Image

By extending the offical Docker image, you could install extra dependencies. I'm extending the image by also installing Docker on it, which will make it possible to use the unix Docker socket to control the host's docker. Make a Dockerfile within the directory with the following:

dockerfile
Dockerfile
FROM apache/airflow:2.4.0-python3.10
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
  ca-certificates \
  curl \
  gnupg \
  lsb-release
# Docker
RUN mkdir -p /etc/apt/keyrings
RUN curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
RUN echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
RUN apt-get update
RUN apt-get install -y --no-install-recommends docker-ce docker-ce-cli containerd.io docker-compose-plugin \
  && apt-get autoremove -yqq --purge \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*
# Usermod
RUN usermod -aG docker airflow
USER airflow
COPY requirements.txt /tmp/requirements.txt
# Installing requirements.
RUN pip install --no-cache-dir -r /tmp/requirements.txt

And also create a requirements.txt with the following.

txt
requirements.txt
apache-airflow[sendgrid]
apache-airflow-providers-docker
docker

The sendgrid provider is just a little extra if you want to send an email to someone (or yourself) a DAG has been completed or if some error occurred.

Now we create to folders and create a .env file with the current user id and the docker image name you would like to name your docker image.

bash
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env
echo -e "AIRFLOW_IMAGE_NAME=yourname/airflow:2.4.0" >> .env

To source the environmental variables, in bash

bash
. .env

bash
source .env

Implenting the Docker Provider

Now we just need to edit the docker-compose.yaml file a little bit. In the higlighted lines, you'll see the changes. Uncomment the build line so that you are able to build the image, and most important, add the added volume line to be able to control the host's docker instance. You can decide if you want the example DAGs loaded or not.

yml
docker-compose.yaml
1version: '3'
2x-airflow-common: &airflow-common
3  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
4  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
5  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
6  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.4.0}
7  build: . # uncomment to build
8  environment: &airflow-common-env
9    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
10    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
11    # For backward compatibility, with Airflow <2.3
12    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
13    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
14    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
15    AIRFLOW__CORE__FERNET_KEY: ''
16    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
17    AIRFLOW__CORE__LOAD_EXAMPLES: 'false' # we don't need to load examples now
18    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
19    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
20  volumes:
21    - ./dags:/opt/airflow/dags
22    - ./logs:/opt/airflow/logs
23    - ./plugins:/opt/airflow/plugins
24    - '/var/run/docker.sock:/var/run/docker.sock' # to control host docker
25  user: '${AIRFLOW_UID:-50000}:0'
26  depends_on: &airflow-common-depends-on
27    redis:
28      condition: service_healthy
29    postgres:
30      condition: service_healthy
31

Keep in mind, that in the above, any top-level key which starts with x- will be ignored by compose, x-airflow-common creates anchors, which are used within the services section (indicated by &).

Building and Initializing Airflow

To build the image

bash
docker-compose build

After the build is completed, you can run the init command, this will initialize database

bash
docker-compose up airflow-init

Defining our DAG

We can define our first DAG now, with the following. I just took an example DAG from the Docker Operator's Example DAGs and changed it to use the simple busybox image.

python
docker.py
1from airflow import models
2from airflow.operators.bash import BashOperator
3from airflow.providers.docker.operators.docker import DockerOperator
4from airflow.utils.dates import days_ago
5
6DAG_ID = 'docker_test'
7
8with models.DAG(
9    DAG_ID,
10    schedule_interval="@once",
11    start_date=days_ago(0),
12    catchup=False,
13    tags=["example", "docker"],
14) as dag:
15    t1 = BashOperator(task_id='print_date', bash_command='date', dag=dag)
16    t2 = BashOperator(task_id='sleep', bash_command='sleep 5', retries=3, dag=dag)
17    # [START howto_operator_docker]
18    t3 = DockerOperator(
19        docker_url='unix://var/run/docker.sock',  # Set your docker URL
20        command='/bin/sleep 10',
21        image='busybox:latest',
22        network_mode='bridge',
23        task_id='docker_op_tester',
24        dag=dag,
25    )
26    # [END howto_operator_docker]
27    t4 = BashOperator(task_id='print_hello', bash_command='echo "hello world!!!"', dag=dag)
28    (
29        # TEST BODY
30        t1
31        >> [t2, t3]
32        >> t4
33    )
34

Running Airflow

Running Airflow is as simple as

bash
docker-compose up

or daemonized

bash
docker-compose up -d

If you naviage with your browser to localhost:8080, you will get the sign in screen

To sign in, use airflow as the Username and airflow as the password. Once you are signed in, you will see the dashboard with a list of all the DAGs (in our case only one)

Activate it as shown and you can also trigger it manually to test it out. If you click on the DAG, you can view the status of each task in different views. Graph view for example will show you the dependencies

And grid view will show you a summarized result

To stop Airflow use the commandline

bash
CTRL+C

or if it was daemonized in folder

bash
docker-compose down

For our little email example, we can expand the environmental variable file with our sendgrid details. You shoud first sign up on Sendgrid and get and API key and use the API key in the .env file.

.env
.env
AIRFLOW_IMAGE_NAME=yourname/airflow:2.4.0
AIRFLOW_UID=1000
SENDGRID_MAIL_FROM=yourmail@yourcompany.com
SENDGRID_API_KEY=yoursendgridapi

Source the environmental variables

bash
. .env

To change the email backend, we can change the environmental variables on the Docker image shown in the highligted lines.

yml
docker-compose.yaml
1version: '3'
2x-airflow-common: &airflow-common
3  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
4  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
5  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
6  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.4.0}
7  build: .
8  environment: &airflow-common-env
9    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
10    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
11    # For backward compatibility, with Airflow <2.3
12    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
13    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
14    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
15    AIRFLOW__CORE__FERNET_KEY: ''
16    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
17    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
18    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
19    AIRFLOW__EMAIL__EMAIL_BACKEND: 'airflow.providers.sendgrid.utils.emailer.send_email' # changing email backend provider
20    SENDGRID_MAIL_FROM: ${SENDGRID_MAIL_FROM} # sendgrid from mail
21    SENDGRID_API_KEY: ${SENDGRID_API_KEY} # sendgrid api key
22    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
23  volumes:
24    - ./dags:/opt/airflow/dags
25    - ./logs:/opt/airflow/logs
26    - ./plugins:/opt/airflow/plugins
27    - '/var/run/docker.sock:/var/run/docker.sock'
28  user: '${AIRFLOW_UID:-50000}:0'
29  depends_on: &airflow-common-depends-on
30    redis:
31      condition: service_healthy
32    postgres:
33      condition: service_healthy
34

AIRFLOW__EMAIL__EMAIL_BACKEND corresponds to a line in airflow's config file. You can change these values with environmental variables using the same convention. To make another DAG, we can just copy te previous DAG and rename the ID and add the Sendgrid task.

python
docker_with_sendgrid.py
1from airflow import models
2from airflow.operators.bash import BashOperator
3from airflow.providers.docker.operators.docker import DockerOperator
4from airflow.utils.dates import days_ago
5from airflow.operators.email_operator import EmailOperator
6
7DAG_ID = 'docker_test_with_sendgrid'
8
9with models.DAG(
10    DAG_ID,
11    schedule_interval="@once",
12    start_date=days_ago(0),
13    catchup=False,
14    tags=["example", "docker", "sendgrid"],
15) as dag:
16    t1 = BashOperator(task_id='print_date', bash_command='date', dag=dag)
17    t2 = BashOperator(task_id='sleep', bash_command='sleep 5', retries=3, dag=dag)
18    # [START howto_operator_docker]
19    t3 = DockerOperator(
20        docker_url='unix://var/run/docker.sock',  # Set your docker URL
21        command='/bin/sleep 10',
22        image='busybox:latest',
23        network_mode='bridge',
24        task_id='docker_op_tester',
25        dag=dag,
26    )
27    # [END howto_operator_docker]
28    t4 = BashOperator(task_id='print_hello', bash_command='echo "hello world!!!"', dag=dag)
29    t5 = EmailOperator(task_id="send_mail",
30                        to='yourmail@yourcompany.com',
31                        subject='Test mail',
32                        html_content='<p> You have got mail! <p>',
33                        dag=dag)
34    (
35        # TEST BODY
36        t1
37        >> [t2, t3]
38        >> t4
39        >> t5
40    )
41

Now when we open Airflow, we will see the new DAG in our Dashboard

If you triggered it, you can also look that the Graph view changed with the added dependency

Finally if Sendgird was setup correctly, you can also look at the logs to see that the email has been sent and your email inbox should have the new email

If you need to do a deep dive, you can shell into a running container by getting the shell script from Airflow

bash
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.4.0/airflow.sh'
chmod +x airflow.sh

and then while Airflow is running

bash
./airflow.sh bash

If you made any mistakes and/or want to remove everything

bash
docker-compose down --volumes --remove-orphans

And that's it. The full source code can be found on my Github.