data.gov
data.gov copied to clipboard
POC Airflow DAG pipeline with Kubernetes Executor
User Story
In order to achieve our production SLA's, our new Harvesting platform will need to perform with ease at scale, in order to do that it's recommended best practice to employ the Kubernetes Executor as outlined here.
This ticket will create a POC sample pipeline that we can learn from and tune.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
- [ ] GIVEN we create a pipeline with Airflow AND we are using the Kubernetes Executor THEN we can begin to understand issues with running these technologies at scale
Background
Operating Airflow at scale presents unique issues for each deployment. We can only begin to understand the nuances of our platform by iterating on a production ETL pipeline using the same conventions that we know we need to operate at scale.
Security Considerations (required)
- Security controls not handled properly in current iteration of SSB with EKS
Sketch
- [ ] Spin up Airflow instance using local KinD cluster
- [ ]
- [ ] Utilize Kubernetes Executor to run tasks in a K8s pod that's part of SSB boundary
- [ ] Configure an ETL pipeline to process DCAT records
- [ ] Determine an appropriate harvest source to use for our baseline
- [ ] Use Astro SDK grab data from S3 or CKAN API
- [ ] Create a single transformation step in the pipeline using the Kubernetes executor
- [ ] Do some small amount of data processing in the K8s container using Snowflake
- [ ] Return the result to Airflow
- [ ] Determine the best way to monitor a TaskGroup and it's accompanying Tasks aso that we can use data-driven methods to improve our implementation