docker-airflow
docker-airflow copied to clipboard
Bump airflow version to 1.10.10
Fix #535 and close #536 where the SQLAlchemy==1.3.16 causing issue based on https://github.com/apache/airflow/issues/8211
Been working fine on our internal airflow
Hopefully we can see this merged. I opened an issue a couple days ago https://github.com/puckel/docker-airflow/issues/536
But it seems like this repo is not maintained anymore. Simple fast forward PRs are not being merged. No new commits...
Bump
With airflow 1.10.10 release they also support a Production docker image. You can read about it in the official blog.
I've been using it instead of this one. So far so good apache/airflow:1.10.10
. Main change between images where change POSTGRES_XXX
to AIRFLOW__CORE__ALCHEMY_CONN
and AIRFLOW__CORE__RESULT_BACKEND
(I think those where the names).
It would be nice if puckel keeped the docker-compose.yml files updated (_CMD
env vars, secrets, add networking...) since the image is no longer necessary
It would be nice if puckel keeped the docker-compose.yml files updated (
_CMD
env vars, secrets, add networking...) since the image is no longer necessary
A production ready docker-compose with secret handling etc would be nice.
@wittfabian , github.com/apache/airflow releases an official production ready image. This repo is no longer maintained (it seems)
docker pull apache/airflow:1.10.10
Yeah, we use that too. What I meant is a description of how to best handle environment variables etc. in the production environments.
Or how best to build on the base image to get your own configuration. User, SSL, connections, variables.
Using the image is usually only half the battle, the difficult part is not described anywhere.
I know what you mean.
I deployed my stack in a single machine with an external db as a service to make the deployment as idempotent as possible (not holding Data or state but only functionality)
Depending on your needs you might want one kind of deployment or another. For example, if you cannot afford spending a couple of months learning kubernetes, or if in a big team anyone should be knowledgeable of the deployment process I suggest some easier technology (docker compose or docker stack)
If you need routing between several instances of the webserver and SSL you can use traefik. But if it's not open to the public, or running inside a VPN then SSL is only makeup.
If you need to scale BIG then I suggest scaling horizontally adding templated Machines docker compose and a bunch of workers. But if you are in early stages you can get away with scaling vertically simply upping the Machine resources
For sincing your dags there are options too, depending on how do you want your deployments done. There's people that:
- build a new image with the dags embedded and roll up
- use git-sync with a shared docker volume
- share a volume with the host and schedule a cron with rsync, git pull or whatever you use as a VCS.
In terms of local deployment it depends on the airflow version you are using . If you choose the stable 1.10.x you can get away with the pip version to run the whole thing. But if you run the 2.0 then it's best to run a small docker-compose stack, because web resources are not built and dependencies are not installed .
So you see, there's lots of options (even more than I listed). It depends on what you want
Since migrating to the official image is being discussed here I want to add some stuff I figured out today.
First of all, here's a discussion on the official docker image with docker-compose examples: https://github.com/puckel/docker-airflow/issues/536
And here's what I had to do to migrate from puckel's image...
Migration to official image
And in order to change from the puckel version to the official one, I had to...
- Change all ENV to the full AF version (Example:
EXECUTOR
->AIRFLOW__CORE__EXECUTOR
) - I also had to use
AIRFLOW__CORE__AIRFLOW_HOME
instead ofAIRFLOW_HOME
. Even though it gives depreciation warnings - Instead of using
POSTGRES_PASSWORD
I had to change to the full conn string:AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres://airflow:airflow@postgres:5432/airflow
- Change
/usr/local/airflow
to/opt/airflow
inAIRFLOW__CORE__AIRFLOW_HOME
and volumes - Run
upgradedb
manually
I was also upgrading from 1.10 and had exceptions when accessing the web interface. It turned out NULL in the dag.description
column caused it. This SQL fixed it:
UPDATE dag
SET description = ''
WHERE description IS NULL;
And Here's my docker-compose config using LocalExecutor...
docker-compose.airflow.yml:
version: '2.1'
services:
airflow:
# image: apache/airflow:1.10.10
build:
context: .
args:
- DOCKER_UID=${DOCKER_UID-1000}
dockerfile: Dockerfile
restart: always
environment:
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres://airflow:${POSTGRES_PW-airflow}@postgres:5432/airflow
- AIRFLOW__CORE__FERNET_KEY=${AF_FERNET_KEY-GUYoGcG5xdn5K3ysGG3LQzOt3cc0UBOEibEPxugDwas=}
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
- AIRFLOW__CORE__LOAD_EXAMPLES=False
- AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
- AIRFLOW__CORE__LOGGING_LEVEL=${AF_LOGGING_LEVEL-info}
volumes:
- ../airflow/dags:/opt/airflow/dags:z
- ../airflow/plugins:/opt/airflow/plugins:z
- ./volumes/airflow_data_dump:/opt/airflow/data_dump:z
- ./volumes/airflow_logs:/opt/airflow/logs:z
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
docker-compose.yml:
version: '2.1'
services:
postgres:
image: postgres:9.6
container_name: af_postgres
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=${POSTGRES_PW-airflow}
- POSTGRES_DB=airflow
- PGDATA=/var/lib/postgresql/data/pgdata
volumes:
- ./volumes/postgres_data:/var/lib/postgresql/data/pgdata:Z
ports:
- 127.0.0.1:5432:5432
webserver:
extends:
file: docker-compose.airflow.yml
service: airflow
container_name: af_webserver
command: webserver
depends_on:
- postgres
ports:
- ${DOCKER_PORTS-8080}
networks:
- proxy
- default
environment:
# Web Server Config
- AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW=graph
- AIRFLOW__WEBSERVER__HIDE_PAUSED_DAGS_BY_DEFAULT=true
- AIRFLOW__WEBSERVER__RBAC=true
# Web Server Performance tweaks
# 2 * NUM_CPU_CORES + 1
- AIRFLOW__WEBSERVER__WORKERS=${AF_WORKERS-2}
# Restart workers every 30min instead of 30seconds
- AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800
labels:
- "traefik.enable=true"
- "traefik.http.routers.airflow.rule=Host(`af.example.com`)"
- "traefik.http.routers.airflow.middlewares=admin-auth@file"
scheduler:
extends:
file: docker-compose.airflow.yml
service: airflow
container_name: af_scheduler
command: scheduler
depends_on:
- postgres
environment:
# Performance Tweaks
# Reduce how often DAGs are reloaded to dramatically reduce CPU use
- AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=${AF_MIN_FILE_PROCESS_INTERVAL-60}
- AIRFLOW__SCHEDULER__MAX_THREADS=${AF_THREADS-1}
networks:
proxy:
external: true
Dockerfile:
# Custom Dockerfile
FROM apache/airflow:1.10.10
# Install mssql support & dag dependencies
USER root
RUN apt-get update -yqq \
&& apt-get install -y gcc freetds-dev \
&& apt-get install -y git procps \
&& apt-get install -y vim
RUN pip install apache-airflow[mssql,mssql,ssh,s3,slack]
RUN pip install azure-storage-blob sshtunnel google-api-python-client oauth2client \
&& pip install git+https://github.com/infusionsoft/Official-API-Python-Library.git \
&& pip install rocketchat_API
# This fixes permission issues on linux.
# The airflow user should have the same UID as the user running docker on the host system.
# make build is adjust this value automatically
ARG DOCKER_UID
RUN \
: "${DOCKER_UID:?Build argument DOCKER_UID needs to be set and non-empty. Use 'make build' to set it automatically.}" \
&& usermod -u ${DOCKER_UID} airflow \
&& find / -path /proc -prune -o -user 50000 -exec chown -h airflow {} \; \
&& echo "Set airflow's uid to ${DOCKER_UID}"
USER airflow
Makefile
And here's my Makefile to control it the containers like make run
:
SERVICE = "scheduler"
TITLE = "airflow containers"
ACCESS = "http://af.example.com"
.PHONY: run
build:
docker-compose build
run:
@echo "Starting $(TITLE)"
docker-compose up -d
@echo "$(TITLE) running on $(ACCESS)"
runf:
@echo "Starting $(TITLE)"
docker-compose up
stop:
@echo "Stopping $(TITLE)"
docker-compose down
restart: stop print-newline run
tty:
docker-compose run --rm --entrypoint='' $(SERVICE) bash
ttyr:
docker-compose run --rm --entrypoint='' -u root $(SERVICE) bash
attach:
docker-compose exec $(SERVICE) bash
attachr:
docker-compose exec -u root $(SERVICE) bash
logs:
docker-compose logs --tail 50 --follow $(SERVICE)
conf:
docker-compose config
initdb:
docker-compose run --rm $(SERVICE) initdb
upgradedb:
docker-compose run --rm $(SERVICE) upgradedb
print-newline:
@echo ""
@echo ""
Also , in the official repo they are working on a docker-compose config file. Feel free to contribute
@wittfabian , github.com/apache/airflow releases an official production ready image. This repo is no longer maintained (it seems)
docker pull apache/airflow:1.10.10
Hello. How were you able to run the official airflow image? I have made the pull and after that docker run apache/airflow:1.10.10
webserver, and I get an error of tables. So I have tried to make initdb before with docker run apache/airflow:1.10.10 initdb; webserver
and it doesn't recognize the second argument. Any suggestion? Thank you very much
You can use a "init task". See: https://github.com/apache/airflow/issues/8605#issuecomment-623182960
@KimchaC Works for me!
make run attach
seems to attach to af_scheduler
. Is it possible to target it to af_webserver
?
@athenawisdoms yes you can adjust the first line in the Makefile: ‘’’ SERVICE = "scheduler" ‘’’
Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .
Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .
@gnomeria Is there any resource on how to do that?
Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .
@gnomeria Is there any resource on how to do that?
I'm sorry I'm pretty swamped at the moment, couldn't get it cleaned up. But the general idea is that we create a webhook with something like this:
import git
import subprocess
import time
from flask import Flask, request, abort
from flask import jsonify
app = Flask(__name__)
def rebuild_docker_compose():
dc_build = subprocess.Popen("docker-compose build", shell=True)
dc_build_status = dc_build.wait()
dc_restart = subprocess.Popen("docker-compose restart", shell=True)
dc_restart_status = dc_restart.wait()
return {'build_status': dc_build_status, 'restart_status': dc_restart_status}
@app.route('/trigger-update', methods=['GET'])
def webhook():
return_status = rebuild_docker_compose()
print("Return code: {}".format(return_status))
res = {
'status': 200,
'extra': return_status
}
return jsonify(res), 200
@app.route('/trigger-update', methods=['POST'])
def webhook_post():
repo = git.Repo('../dags')
repo.remotes.origin.pull()
res = {
'status': 200
}
return jsonify(res), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8081)
and run this with a screen
command on:
#!/bin/bash
until python3 webhook.py; do
echo "'webhook.py' crashed with exit code $?. Restarting..." >&2
sleep 5
done
Some of the steps are a bit manual too. And it lacks some security measures. For the webhook, I think you should do something like https://github.com/adnanh/webhook to run a git pull with a webhook token.
Though it's been running fine for almost a year now with that😃
For the airflow-dag repo, it's a bit more straightforward, and it contains only a dag codes. It's also being structured with multiple folders and has some shared commons/modules with unit testing with pytest
Hi
https://github.com/puckel/docker-airflow/pull/576
Uses official airflow image in Puckel's docker compose files.
Just for reference, we've also been using airflow by dividing it into two repositories of dags and the other one is for the docker-compose infra. The dags repository is separated and is updated by a webhook and will trigger an update from our CI and VCS to run a Python-Git command to pull. Probably I can cleanup and put it open source later on .
@gnomeria Is there any resource on how to do that?
I'm sorry I'm pretty swamped at the moment, couldn't get it cleaned up. But the general idea is that we create a webhook with something like this:
import git import subprocess import time from flask import Flask, request, abort from flask import jsonify app = Flask(__name__) def rebuild_docker_compose(): dc_build = subprocess.Popen("docker-compose build", shell=True) dc_build_status = dc_build.wait() dc_restart = subprocess.Popen("docker-compose restart", shell=True) dc_restart_status = dc_restart.wait() return {'build_status': dc_build_status, 'restart_status': dc_restart_status} @app.route('/trigger-update', methods=['GET']) def webhook(): return_status = rebuild_docker_compose() print("Return code: {}".format(return_status)) res = { 'status': 200, 'extra': return_status } return jsonify(res), 200 @app.route('/trigger-update', methods=['POST']) def webhook_post(): repo = git.Repo('../dags') repo.remotes.origin.pull() res = { 'status': 200 } return jsonify(res), 200 if __name__ == '__main__': app.run(host='0.0.0.0', port=8081)
and run this with a
screen
command on:#!/bin/bash until python3 webhook.py; do echo "'webhook.py' crashed with exit code $?. Restarting..." >&2 sleep 5 done
Some of the steps are a bit manual too. And it lacks some security measures. For the webhook, I think you should do something like https://github.com/adnanh/webhook to run a git pull with a webhook token.
Though it's been running fine for almost a year now with thatsmiley
For the airflow-dag repo, it's a bit more straightforward, and it contains only a dag codes. It's also being structured with multiple folders and has some shared commons/modules with unit testing with
pytest
Hey. This is really nice. Thanks a lot.
The easy way is mount requirement.txt
volumes:
- ./requirements.txt:/requirements.txt
requirement.txt:
apache-airflow[gcp]==1.10.12