charts icon indicating copy to clipboard operation
charts copied to clipboard

support loading DAG definitions from S3 buckets

Open thesuperzapper opened this issue 4 years ago • 7 comments

Currently we support git-sync with the dags.gitSync.* values, but we can probably do something similar for S3 buckets. That is, let people store their dags in a folder on an S3 bucket.

Possibly we should generalise this to include GCS and ABS, but these probably have different libraries needed to do the sync (so might need to be separate features/containers). However, clearly S3 is the best place to start, as it's the most popular.

thesuperzapper avatar Jun 30 '21 09:06 thesuperzapper

hey guys,

until we have it as native solution i created a sidecar container for syncing dags from aws s3 take a look :)

https://github.com/yossisht9876/airflow-s3-dag-sync

yossisht9876 avatar Apr 26 '22 07:04 yossisht9876

Hi @thesuperzapper ,

I started working on this and implementing kind of similar to syncing dags from git as you mentioned - My approach is that we can use rclone sync "running as k8s job" to fetch data from s3 bucket containing the dags and store these dags in a mount volume , that volume is also mounted to AF scheduler pod - Should I continue implementing that ?

Best Regards,

tarekabouzeid avatar May 12 '22 15:05 tarekabouzeid

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3

kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800

yossisht9876 avatar Jul 27 '22 07:07 yossisht9876

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3

kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800

Not a bad idea, I'd also add that if you want the GitOps approach, you can disable the schedule via suspend: true Then create an ad-hoc s3-sync Job/Pod from the CronJob as a template from your CICD via kubectl create --from=cronjob/s3-sync https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-job-em-

darren-recentive avatar Oct 21 '23 22:10 darren-recentive

I just want to say that while baked-in support for s3-sync did NOT make it into version 8.9.0 of the chart, you can use the extraInitContainers and extraContainers values that were added in https://github.com/airflow-helm/charts/pull/856.

Now you can effectively do what was proposed in https://github.com/airflow-helm/charts/pull/828, by using the following values:

  • For Scheduler/Webserver/Workers (but not KubernetesExecutor):
    • airflow.extraContainers (looping sidecar to sync into dags folder)
    • airflow.extraInitContainers (initial clone of S3 bucket into dags folder)
    • airflow.extraVolumeMounts (mount the emptyDir)
    • airflow.extraVolumes (define an emptyDir volume)
  • For KubernetesExecutor Pod template:
    • ~airflow.kubernetesPodTemplate.extraContainers~ (you don't need the sidecar for transient Pods)
    • airflow.kubernetesPodTemplate.extraInitContainers
    • airflow.kubernetesPodTemplate.extraVolumeMounts
    • airflow.kubernetesPodTemplate.extraVolumes

If someone wants to share their values and report how well it works, I am sure that would help others.

PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in https://github.com/airflow-helm/charts/issues/249#issuecomment-1196360490

thesuperzapper avatar May 01 '24 02:05 thesuperzapper