telemetry-airflow icon indicating copy to clipboard operation
telemetry-airflow copied to clipboard

Jobs run with GKEOperator need `get_logs=False`, otherwise job is likely to fail unless constantly logging to standard out

Open wlach opened this issue 4 years ago • 2 comments

I noticed this while working on adding the missioncontrol-etl job (#840), but apparently this happened with the probe scraper as well.

tl;dr: a job must print something to standard out / error every 30 seconds or so, or else it will fail with a mysterious error saying IncompleteRead:

https://issues.apache.org/jira/browse/AIRFLOW-3534

I'm not sure if there's an easy / good workaround here. The function that's causing the problem is called read_namespaced_pod_log, which (AFAICT) is using a persistently opened http connection in Kubernetes to read the log under the hood:

https://github.com/apache/airflow/blob/c890d066965aa9dbf3016f41cfae45e9a084478a/airflow/kubernetes/pod_launcher.py#L173

I did some spelunking in the kubernetes python repository + issue tracker, and to be honest it doesn't seem like this type of use case is really taken into account with the API. There is no way to pick up the logs again in the event of a timeout or similiar, see for example this issue comment:

https://github.com/kubernetes-client/python/issues/199#issuecomment-430123395

The workaround is just to not get the logs and rely on stackdriver logging. This is pretty non-ideal: it increases the amount of filtering/spelunking you would need to do pretty significantly in the case that something goes wrong. Filing this issue for internal visibility, as it's a pretty serious gotcha.

wlach avatar Jan 22 '20 21:01 wlach

Thought I had solved it with this, but it doesn't actually fix the issue: https://github.com/wlach/airflow/commit/e7ae01ac608d3f944b691875a9cf90dceb60ebcc

Will make further comments on my investigation in their issue tracker, starting with: https://issues.apache.org/jira/browse/AIRFLOW-3534?focusedCommentId=17023334&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17023334

wlach avatar Jan 24 '20 23:01 wlach

Can you check in your logs if your task is marked as zombie. If it is then increase the duration of scheduler_zombie_task_threshold from default 5 minutes to something ~n minutes. When logs are not printed, it seems like worker doesn't send heartbeat to the DB and scheduler marks it as failure after scheduler_zombie_task_threshold minutes

brihati avatar Apr 10 '20 09:04 brihati