data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Configure S3 to store Airflow Logs

Open robert-bryson opened this issue 1 year ago • 5 comments

Airflow can store logs in a variety of ways. Should we want to use the deploy scenario of Airflow on Cloud Foundry, we would want to store the logs externally. There is functionality to allow writing logs to s3 that uses a s3 connector to handle the auth.

See additional context https://github.com/GSA/data.gov/issues/4434 and https://github.com/GSA/datagov-harvester/pull/1.

How to reproduce

  1. View the logs on a dag run (for example) on the https://test-airflow-webserver.app.cloud.gov/ deployment.
  2. See error in the cf app logs: image

Expected behavior

Populated logs

image

Actual behavior

No logs

image

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

  • Check the s3 service is set up correctly to allow Airflow to connect
  • Check the s3 connection in Airflow is set up correctly
  • ???

robert-bryson avatar Sep 27 '23 16:09 robert-bryson

Question for the team, if we have logs in New Relic, do we want them also in S3?

btylerburton avatar Nov 30 '23 22:11 btylerburton

I suppose it doesn't make sense to double store the logs. I believe the idea behind the s3 connector is that all the various airflow components can drop logs in one place and then you can use whatever you'd like to aggregate them from there. Since our team is already using New Relic for this, it probably isn't necessary. Should we icebox or close this?

robert-bryson avatar Dec 04 '23 19:12 robert-bryson

I'm going to close it. If we need to revive it should be easy enough since it has the h20 label.

btylerburton avatar Dec 05 '23 16:12 btylerburton

When the airflow configuration changes (worker scaling, etc), the logs are not guaranteed to be accessible, from what I've seen, so I believe this is worth reopening and revisiting in the future.

Image

btylerburton avatar Dec 06 '23 20:12 btylerburton

It seems like there isn't a good consensus on operational and maintenance procedures.. If the logs are in NR, then creating a process for linking to those logs (i.e. doing an API call to fetch the logs in whatever viewer makes sense might work OR just going to NR being an expert in finding logs haha...). The alternative of having "duplicate" logs only hurts if there's a heavy cost or maintenance burden involved, neither of which sound like it's the case. NR only stores 3 months of logs as is, soo... S3 would give you longer log storage too.

nickumia avatar Dec 07 '23 04:12 nickumia