data.gov
data.gov copied to clipboard
Configure S3 to store Airflow Logs
Airflow can store logs in a variety of ways. Should we want to use the deploy scenario of Airflow on Cloud Foundry, we would want to store the logs externally. There is functionality to allow writing logs to s3
that uses a s3
connector to handle the auth.
See additional context https://github.com/GSA/data.gov/issues/4434 and https://github.com/GSA/datagov-harvester/pull/1.
How to reproduce
- View the logs on a dag run (for example) on the https://test-airflow-webserver.app.cloud.gov/ deployment.
- See error in the
cf app
logs:
Expected behavior
Populated logs
Actual behavior
No logs
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]
- Check the
s3
service is set up correctly to allow Airflow to connect - Check the
s3
connection in Airflow is set up correctly - ???
Question for the team, if we have logs in New Relic, do we want them also in S3?
I suppose it doesn't make sense to double store the logs. I believe the idea behind the s3 connector is that all the various airflow components can drop logs in one place and then you can use whatever you'd like to aggregate them from there. Since our team is already using New Relic for this, it probably isn't necessary. Should we icebox or close this?
I'm going to close it. If we need to revive it should be easy enough since it has the h20 label.
When the airflow configuration changes (worker scaling, etc), the logs are not guaranteed to be accessible, from what I've seen, so I believe this is worth reopening and revisiting in the future.
It seems like there isn't a good consensus on operational and maintenance procedures.. If the logs are in NR, then creating a process for linking to those logs (i.e. doing an API call to fetch the logs in whatever viewer makes sense might work OR just going to NR being an expert in finding logs haha...). The alternative of having "duplicate" logs only hurts if there's a heavy cost or maintenance burden involved, neither of which sound like it's the case. NR only stores 3 months of logs as is, soo... S3 would give you longer log storage too.