feat: add s3-sync sidecar
What issues does your PR fix?
- fixes #249
What does your PR do?
Overview
This Pull Request introduces a new feature, s3Sync, designed to enhance our application's ability to synchronize data with AWS S3. This addition aims to provide a more robust and flexible solution for managing cloud storage synchronization tasks.
Details
- New Feature: Implemented the s3Sync functionality, leveraging the official aws-cli library and straightforward logic to establish core functionalities such as stability and the automatic detection of changes.
- Ensured that the new
s3Syncfeature is fully compatible with our existing infrastructure and does not introduce any breaking changes or dependencies.
No Changes to Existing gitSync Functionality
- It's crucial to note that while developing the
s3Syncfeature, special care was taken not to modify or affect the existinggitSyncfunctionality. Our commitment was to add value without disrupting current operations or workflows. - Comprehensive testing has been conducted to confirm that
gitSyncremains unaffected and operates as expected.
Testing and Validation
- Conducted basic tests such as ct lint and helm template, and performed operational tests on an actual Kubernetes cluster, especially
gitSync, continue to operate without any issues.
Conclusion
This enhancement is a step forward in our ongoing efforts to provide a seamless and powerful toolset for our users. By introducing s3Sync, we are expanding our capabilities while ensuring the integrity and performance of our existing features remain intact.
I look forward to your feedback and any discussions regarding this PR. Thank you for considering these enhancements.
Checklist
For all Pull Requests
- [x] Commits are signed off
- [x] Commits have semantic messages
- [x] Documentation updated
- [x] Passes ct linting
For releasing ONLY
- [ ] Chart.yaml version bumped
- [ ] CHANGELOG.md updated
I am curious to know how syncing DAGs from S3 work? do we need to create a kubectl secret with our AWS key and secret key, and how often will it poll for new DAG files/ folders?
This would be useful to have in the Helm chart. Git sync is sometime not the best option.
just another bump on this PR. This seems like the best option for AirFlow deployments in EKS.
And just to clarify about AWS credentials, in general we would be using IAM roles rather than user credentials, so there should be no need for additional k8s secrets, or anything like that.
@chirichidi thanks for the very interesting PR, I would love to get "s3-sync" as a concept into the chart (as it will help users migrate from MWAA).
The main thing we need to finalize is the "reconciliation loop", everything else is secondary and can be updated later.
If I understand your PR correctly, you have done the following:
- You have implemented a "sidecar" pattern similar to our gitsync sidecar
- An
init-containerwhich runs the following command (to populate the dags folder as the pod starts)aws s3 cp --recursive s3://<BUCKET>/<PATH> ./path/to/dags
- A sidecar
containerwhich runs the following command on loop (to keep the dags folder up to date):aws s3 sync --delete s3://<BUCKET>/<PATH> ./path/to/dags
My main concerns are:
- What happens when a sync is halfway, but airflow starts refreshing the DAGs (so we have some old and some new)?
- This is avoided in git-sync by using symbolic link switching.
- It's likley rare for something seriously bad to happen when this occurs, so it might not matter.
- Wouldn't it make more sense to also use
aws s3 syncfor the init-container?- Because init-containers can sometimes run again (when the pod restarts), so we can save a bit of time by not re-downloading everything.
- I wonder if we might want to use some or all of the following
aws syncparameters:--quiet--only-show-errors
Are there any other things I have missed?
PS: if/when we merge this, I will update the values/docs in your PR to match the style of the chart.