charts make timeout for check-db configurable, and ensure logs are written when timeout happens

Checks

[X] I have checked for existing issues.
[X] This report is about the User-Community Airflow Helm Chart.

Motivation

In our company we are currently setting up a cluster for airflow, and while debugging some networking issues between our cluster and rds we seem to constantly hit the timeout settings in here https://github.com/airflow-helm/charts/blob/main/charts/airflow/templates/_helpers/pods.tpl#L72-L75 , which once hit hides all of the important logs that we would have received if the process failed on it's on.

The way we handle it right now is by manually editing the container's command and setting the timeout to something higher.

Implementation

Here are some different ideas:

Include a variable under airflow.check-db that would allow us to set a custom timeout
Remove the timeout completely, since this should be handled internally by airflow with the CONNECTION_CHECK_SLEEP_TIME and CONNECTION_CHECK_MAX_COUNT variables (see https://airflow.apache.org/docs/docker-stack/entrypoint.html#waits-for-airflow-db-connection ) -- I prefer this solution

Are you willing & able to help?

[X] I am able to submit a PR!
[X] I can help test the feature!

Jun 21 '22 15:06 asosnovsky-sumologic

@thesuperzapper I can make a PR for this, but which approach would you prefer I take? (if you don't care I can just remove the timeout, which is my preference)

Jun 23 '22 17:06 asosnovsky-sumologic

We can pick the timeout value from values.yaml by using something like .Values.checkDb.timeout and add if else condition to use default timeout value? I am also facing similar issue, can you tell me how do you manually edit the container's command?

Aug 18 '22 12:08 anuragphadnis

@asosnovsky-sumologic @anuragphadnis

Thanks for raising this, I am not sure why I did not introduce a value to change the timeout length when I first introduced the timeout.

However, I do remember that there are situations when airflow checkdb and airflow db check will hang forever without actually failing (like in your described situation of network issues, https://github.com/airflow-helm/charts/issues/153), so you probably would not see network logs if you remove the timeout anyway.

While you are correct that the airflow Dockerfile has an in-built check-db, this is not suitable for the chart for 2 primary reasons:

We want a separate init-container to perform the check-db (so that the main container does not even try and start if the db is down)
Older versions of the Dockerfile (like those for 1.10) do not have the check-db feature

My thought is that we can do two things:

Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
Check if we get more logs when we lower the main signal to SIGINT, and introduce a KILL signal at a slightly later timeout to ensure the process ends:
- The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
- NOTE: I am not sure if SIGINT will actually terminate airflow db check, or if it will even give better logging (in network failure situations), we need to verify this, otherwise this change is not neeeded

Other thoughts:

60s is an incredibly long time for a timeout that literally just connects to a DB, if a connection takes more time than that, its very likely your DB is not functional
60s is the timeout for a single connection attempt, Kubernetes will retry the container (with exponential backoff) so its not like its 60s and then a permanent failure

Aug 18 '22 22:08 thesuperzapper

In my case the check-db pod gets restarted after every 60 seconds and connection is not being able to establish in this time. This happens every time after check-db is restarted, and check-db does not log anything. It would be very useful if we can customize the value so that we can see logs or establish the connection.

Aug 19 '22 05:08 anuragphadnis

@anuragphadnis

I agree with this line of thinking. In my case I had a networking issue that restricted the db from being reachable by my cluster. The issue was hard to capture because when you do the math with CONNECTION_CHECK_SLEEP_TIME and CONNECTION_CHECK_MAX_COUNT no errors actually got thrown as it just silently kept on retrying. I actually think that the timeout should possibly be set as a dynamic value based on these two environment params?

Like do something along the lines of

{{- if .Values.airflow.legacyCommands }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow checkdb"
{{- else }}
- "exec timeout {{mul .Values.config.CONNECTION_CHECK_SLEEP_TIME .Values.config.CONNECTION_CHECK_MAX_COUNT}}s airflow db check"
{{- end }}

Or even provide these values under

# value.yaml
airflow:
 db_connection:
  check_max_count: 10
  check_sleep_time: 30

Sep 12 '22 17:09 asosnovsky

Yes @asosnovsky

Right now I have modified it in a similar way to use value from config and using that code to deploy the chart. I can create a PR if someone else is not working on it.

Sep 13 '22 11:09 anuragphadnis

@anuragphadnis i don't mind taking this to completion. If your change is already merged, I can do the rest of the work and update the checkdb command :)

Sep 13 '22 14:09 asosnovsky

@asosnovsky @anuragphadnis @asosnovsky-sumologic

If anyone wants to pick up the tasks I suggested in https://github.com/airflow-helm/charts/issues/615#issuecomment-1220050680, I am happy to review/merge the PR, the tasks are:

Introduce a value like airflow.dbCheckTimeout (probably setting this to 0 disables the timeout).
Find a way to actually get logs when a timeout is hit (so people know what is happening):
- We might be able to do this by lowering the main signal to SIGINT, and introducing a KILL signal at a slightly later timeout to ensure the process ends:
  - The command might look something like: timeout --signal=INT --kill-after=5s 60s airflow checkdb
  - NOTE: I am not sure if SIGINT will actually terminate airflow db check, we need to verify this
- Alternatively, we can add our own bash trap wrapper that will log a "connection timed out message" if a SIGTERM is passed

Sep 14 '22 01:09 thesuperzapper

charts charts copied to clipboard

make timeout for check-db configurable, and ensure logs are written when timeout happens

Checks

Motivation

Implementation

Are you willing & able to help?

charts
charts copied to clipboard