docker-airflow Bump airflow version to 1.10.10

Bump airflow version to 1.10.10

Open gnomeria opened this issue 4 years ago • 20 comments

Fix #535 and close #536 where the SQLAlchemy==1.3.16 causing issue based on https://github.com/apache/airflow/issues/8211

Apr 12 '20 09:04 gnomeria

Been working fine on our internal airflow

Apr 12 '20 09:04 gnomeria

Hopefully we can see this merged. I opened an issue a couple days ago https://github.com/puckel/docker-airflow/issues/536

But it seems like this repo is not maintained anymore. Simple fast forward PRs are not being merged. No new commits...

Apr 12 '20 17:04 dinigo

Bump

Apr 15 '20 12:04 elwinarens

With airflow 1.10.10 release they also support a Production docker image. You can read about it in the official blog.

I've been using it instead of this one. So far so good apache/airflow:1.10.10. Main change between images where change POSTGRES_XXX to AIRFLOW__CORE__ALCHEMY_CONN and AIRFLOW__CORE__RESULT_BACKEND (I think those where the names).

It would be nice if puckel keeped the docker-compose.yml files updated (_CMD env vars, secrets, add networking...) since the image is no longer necessary

Apr 16 '20 09:04 dinigo

It would be nice if puckel keeped the docker-compose.yml files updated (_CMD env vars, secrets, add networking...) since the image is no longer necessary

A production ready docker-compose with secret handling etc would be nice.

May 05 '20 17:05 wittfabian

@wittfabian , github.com/apache/airflow releases an official production ready image. This repo is no longer maintained (it seems)

docker pull apache/airflow:1.10.10

May 07 '20 07:05 dinigo

Yeah, we use that too. What I meant is a description of how to best handle environment variables etc. in the production environments.

Or how best to build on the base image to get your own configuration. User, SSL, connections, variables.

Using the image is usually only half the battle, the difficult part is not described anywhere.

May 07 '20 07:05 wittfabian

I know what you mean.

I deployed my stack in a single machine with an external db as a service to make the deployment as idempotent as possible (not holding Data or state but only functionality)

Depending on your needs you might want one kind of deployment or another. For example, if you cannot afford spending a couple of months learning kubernetes, or if in a big team anyone should be knowledgeable of the deployment process I suggest some easier technology (docker compose or docker stack)

If you need routing between several instances of the webserver and SSL you can use traefik. But if it's not open to the public, or running inside a VPN then SSL is only makeup.

If you need to scale BIG then I suggest scaling horizontally adding templated Machines docker compose and a bunch of workers. But if you are in early stages you can get away with scaling vertically simply upping the Machine resources

For sincing your dags there are options too, depending on how do you want your deployments done. There's people that:

build a new image with the dags embedded and roll up
use git-sync with a shared docker volume
share a volume with the host and schedule a cron with rsync, git pull or whatever you use as a VCS.

In terms of local deployment it depends on the airflow version you are using . If you choose the stable 1.10.x you can get away with the pip version to run the whole thing. But if you run the 2.0 then it's best to run a small docker-compose stack, because web resources are not built and dependencies are not installed .

So you see, there's lots of options (even more than I listed). It depends on what you want

May 09 '20 23:05 dinigo

Since migrating to the official image is being discussed here I want to add some stuff I figured out today.

First of all, here's a discussion on the official docker image with docker-compose examples: https://github.com/puckel/docker-airflow/issues/536

And here's what I had to do to migrate from puckel's image...

Migration to official image

And in order to change from the puckel version to the official one, I had to...

Change all ENV to the full AF version (Example: EXECUTOR -> AIRFLOW__CORE__EXECUTOR )
I also had to use AIRFLOW__CORE__AIRFLOW_HOME instead of AIRFLOW_HOME. Even though it gives depreciation warnings
Instead of using POSTGRES_PASSWORD I had to change to the full conn string: AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres://airflow:airflow@postgres:5432/airflow
Change /usr/local/airflow to /opt/airflow in AIRFLOW__CORE__AIRFLOW_HOME and volumes
Run upgradedb manually

I was also upgrading from 1.10 and had exceptions when accessing the web interface. It turned out NULL in the dag.description column caused it. This SQL fixed it:

UPDATE dag
SET description = ''
WHERE description IS NULL;

And Here's my docker-compose config using LocalExecutor...

docker-compose.airflow.yml:

version: '2.1'
services:
    airflow:
        # image: apache/airflow:1.10.10
        build:
            context: .
            args:
                - DOCKER_UID=${DOCKER_UID-1000} 
            dockerfile: Dockerfile
        restart: always
        environment:
            - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres://airflow:${POSTGRES_PW-airflow}@postgres:5432/airflow
            - AIRFLOW__CORE__FERNET_KEY=${AF_FERNET_KEY-GUYoGcG5xdn5K3ysGG3LQzOt3cc0UBOEibEPxugDwas=}
            - AIRFLOW__CORE__EXECUTOR=LocalExecutor
            - AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
            - AIRFLOW__CORE__LOAD_EXAMPLES=False
            - AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
            - AIRFLOW__CORE__LOGGING_LEVEL=${AF_LOGGING_LEVEL-info}
        volumes:
            - ../airflow/dags:/opt/airflow/dags:z
            - ../airflow/plugins:/opt/airflow/plugins:z
            - ./volumes/airflow_data_dump:/opt/airflow/data_dump:z
            - ./volumes/airflow_logs:/opt/airflow/logs:z
        healthcheck:
            test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3

docker-compose.yml:

version: '2.1'
services:
    postgres:
        image: postgres:9.6
        container_name: af_postgres
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=${POSTGRES_PW-airflow}
            - POSTGRES_DB=airflow
            - PGDATA=/var/lib/postgresql/data/pgdata
        volumes:
            - ./volumes/postgres_data:/var/lib/postgresql/data/pgdata:Z
        ports:
            -  127.0.0.1:5432:5432

    webserver:
        extends:
            file: docker-compose.airflow.yml
            service: airflow
        container_name: af_webserver
        command: webserver
        depends_on:
            - postgres
        ports:
            - ${DOCKER_PORTS-8080}
        networks:
            - proxy
            - default
        environment:
            # Web Server Config
            - AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW=graph
            - AIRFLOW__WEBSERVER__HIDE_PAUSED_DAGS_BY_DEFAULT=true
            - AIRFLOW__WEBSERVER__RBAC=true

            # Web Server Performance tweaks
            # 2 * NUM_CPU_CORES + 1
            - AIRFLOW__WEBSERVER__WORKERS=${AF_WORKERS-2}
            # Restart workers every 30min instead of 30seconds
            - AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800
        labels:
            - "traefik.enable=true"
            - "traefik.http.routers.airflow.rule=Host(`af.example.com`)"
            - "traefik.http.routers.airflow.middlewares=admin-auth@file"

    scheduler:
        extends:
            file: docker-compose.airflow.yml
            service: airflow
        container_name: af_scheduler
        command: scheduler
        depends_on:
            - postgres
        environment:
            # Performance Tweaks
            # Reduce how often DAGs are reloaded to dramatically reduce CPU use
            - AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=${AF_MIN_FILE_PROCESS_INTERVAL-60} 
            - AIRFLOW__SCHEDULER__MAX_THREADS=${AF_THREADS-1}

networks:
    proxy:
        external: true

Dockerfile:

# Custom Dockerfile
FROM apache/airflow:1.10.10

# Install mssql support & dag dependencies
USER root
RUN apt-get update -yqq \
    && apt-get install -y gcc freetds-dev \
    && apt-get install -y git procps \ 
    && apt-get install -y vim
RUN pip install apache-airflow[mssql,mssql,ssh,s3,slack] 
RUN pip install azure-storage-blob sshtunnel google-api-python-client oauth2client \
    && pip install git+https://github.com/infusionsoft/Official-API-Python-Library.git \
    && pip install rocketchat_API

# This fixes permission issues on linux. 
# The airflow user should have the same UID as the user running docker on the host system.
# make build is adjust this value automatically
ARG DOCKER_UID
RUN \
    : "${DOCKER_UID:?Build argument DOCKER_UID needs to be set and non-empty. Use 'make build' to set it automatically.}" \
    && usermod -u ${DOCKER_UID} airflow \
    && find / -path /proc -prune -o -user 50000 -exec chown -h airflow {} \; \
    && echo "Set airflow's uid to ${DOCKER_UID}"

USER airflow

Makefile

And here's my Makefile to control it the containers like make run:

SERVICE = "scheduler"
TITLE = "airflow containers"
ACCESS = "http://af.example.com"

.PHONY: run

build:
	docker-compose build

run:
	@echo "Starting $(TITLE)"
	docker-compose up -d
	@echo "$(TITLE) running on $(ACCESS)"

runf:
	@echo "Starting $(TITLE)"
	docker-compose up

stop:
	@echo "Stopping $(TITLE)"
	docker-compose down

restart: stop print-newline run

tty:
	docker-compose run --rm --entrypoint='' $(SERVICE) bash

ttyr:
	docker-compose run --rm --entrypoint='' -u root $(SERVICE) bash

attach:
	docker-compose exec $(SERVICE) bash

attachr:
	docker-compose exec -u root $(SERVICE) bash

logs:
	docker-compose logs --tail 50 --follow $(SERVICE)

conf:
	docker-compose config

initdb:
	docker-compose run --rm $(SERVICE) initdb

upgradedb:
	docker-compose run --rm $(SERVICE) upgradedb

print-newline:
	@echo ""
	@echo ""

May 10 '20 14:05 infused-kim

Also , in the official repo they are working on a docker-compose config file. Feel free to contribute

May 10 '20 15:05 dinigo

@wittfabian , github.com/apache/airflow releases an official production ready image. This repo is no longer maintained (it seems)
docker pull apache/airflow:1.10.10 

Hello. How were you able to run the official airflow image? I have made the pull and after that docker run apache/airflow:1.10.10 webserver, and I get an error of tables. So I have tried to make initdb before with docker run apache/airflow:1.10.10 initdb; webserver and it doesn't recognize the second argument. Any suggestion? Thank you very much

May 18 '20 10:05 JavierLTPromofarma

You can use a "init task". See: https://github.com/apache/airflow/issues/8605#issuecomment-623182960

May 18 '20 10:05 wittfabian

@KimchaC Works for me!

make run attach seems to attach to af_scheduler. Is it possible to target it to af_webserver?

May 18 '20 19:05 athenawisdoms

@athenawisdoms yes you can adjust the first line in the Makefile: ‘’’ SERVICE = "scheduler" ‘’’

May 19 '20 08:05 infused-kim

Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .

May 26 '20 07:05 gnomeria

Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .

@gnomeria Is there any resource on how to do that?

Jul 21 '20 15:07 swapniel99

Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .

@gnomeria Is there any resource on how to do that?

I'm sorry I'm pretty swamped at the moment, couldn't get it cleaned up. But the general idea is that we create a webhook with something like this:

import git
import subprocess
import time
from flask import Flask, request, abort
from flask import jsonify

app = Flask(__name__)


def rebuild_docker_compose():
    dc_build = subprocess.Popen("docker-compose build", shell=True)
    dc_build_status = dc_build.wait()
    dc_restart = subprocess.Popen("docker-compose restart", shell=True)
    dc_restart_status = dc_restart.wait()
    return {'build_status': dc_build_status, 'restart_status': dc_restart_status}


@app.route('/trigger-update', methods=['GET'])
def webhook():
    return_status = rebuild_docker_compose()
    print("Return code: {}".format(return_status))
    res = {
        'status': 200,
        'extra': return_status
    }
    return jsonify(res), 200


@app.route('/trigger-update', methods=['POST'])
def webhook_post():
    repo = git.Repo('../dags')
    repo.remotes.origin.pull()
    res = {
        'status': 200
    }
    return jsonify(res), 200


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)

and run this with a screen command on:

#!/bin/bash
until python3 webhook.py; do
    echo "'webhook.py' crashed with exit code $?. Restarting..." >&2
    sleep 5
done

Some of the steps are a bit manual too. And it lacks some security measures. For the webhook, I think you should do something like https://github.com/adnanh/webhook to run a git pull with a webhook token.

Though it's been running fine for almost a year now with that😃

For the airflow-dag repo, it's a bit more straightforward, and it contains only a dag codes. It's also being structured with multiple folders and has some shared commons/modules with unit testing with pytest

Jul 23 '20 13:07 gnomeria

https://github.com/puckel/docker-airflow/pull/576

Uses official airflow image in Puckel's docker compose files.

Jul 24 '20 17:07 swapniel99

Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .

@gnomeria Is there any resource on how to do that?

I'm sorry I'm pretty swamped at the moment, couldn't get it cleaned up. But the general idea is that we create a webhook with something like this:
import git
import subprocess
import time
from flask import Flask, request, abort
from flask import jsonify

app = Flask(__name__)


def rebuild_docker_compose():
    dc_build = subprocess.Popen("docker-compose build", shell=True)
    dc_build_status = dc_build.wait()
    dc_restart = subprocess.Popen("docker-compose restart", shell=True)
    dc_restart_status = dc_restart.wait()
    return {'build_status': dc_build_status, 'restart_status': dc_restart_status}


@app.route('/trigger-update', methods=['GET'])
def webhook():
    return_status = rebuild_docker_compose()
    print("Return code: {}".format(return_status))
    res = {
        'status': 200,
        'extra': return_status
    }
    return jsonify(res), 200


@app.route('/trigger-update', methods=['POST'])
def webhook_post():
    repo = git.Repo('../dags')
    repo.remotes.origin.pull()
    res = {
        'status': 200
    }
    return jsonify(res), 200


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)
and run this with a screen command on:
#!/bin/bash
until python3 webhook.py; do
    echo "'webhook.py' crashed with exit code $?. Restarting..." >&2
    sleep 5
done
Some of the steps are a bit manual too. And it lacks some security measures. For the webhook, I think you should do something like https://github.com/adnanh/webhook to run a git pull with a webhook token.

Though it's been running fine for almost a year now with thatsmiley

For the airflow-dag repo, it's a bit more straightforward, and it contains only a dag codes. It's also being structured with multiple folders and has some shared commons/modules with unit testing with pytest

Hey. This is really nice. Thanks a lot.

Jul 24 '20 17:07 swapniel99

The easy way is mount requirement.txt

volumes:
      - ./requirements.txt:/requirements.txt

requirement.txt:

apache-airflow[gcp]==1.10.12

Aug 28 '20 09:08 mikeccuk2005

docker-airflow docker-airflow copied to clipboard

Bump airflow version to 1.10.10

Migration to official image

docker-compose.airflow.yml:

docker-compose.yml:

Dockerfile:

Makefile

docker-airflow
docker-airflow copied to clipboard