snuba feat(api): Add a clickhouse connection sanity check to api healthcheck

Overview

This is a follow up item for inc-894, where a API pod landed on a bad node with no clickhouse connection. Because it was not configured with a thorough healthcheck, the pod stayed and served 5XX to all requests.

Adds a connection check to see if any clickhouse query node is operable. Fails the healthcheck if all query nodes are inoperable, as it likely indicates connection error with the host, so k8 can rotate the pod. Renames the previous check_clickhouse function to check_all_tables_present to distinguish the 2.

Oct 10 '24 22:10 john-z-yang

LGTM. A couple of questions:

Is this the liveness probe solution we discussed during the post mortem of inc-894? If so, can we call out the incident number and resolution in the PR description?

In addition to the query node health, why are we not also checking the storage node health? Or is that already done somewhere else?

Thanks for the questions @onkar

Yes this is the follow up for inc-894. I didn't reference the inc number in a pr because this is a public open source repo and incidents are internal within the company, but I updated the description.
This is technically done by the ingestion consumers, they're the only ones directly communicating with those nodes. If they are not able to perform insert they will exit with a bad status code, forcing k8s to rotate them. Checking them here would not be appropriate because we still want to serve request to existing data even if ingestion stopped.

Oct 14 '24 17:10 john-z-yang

LGTM. A couple of questions:

Is this the liveness probe solution we discussed during the post mortem of inc-894? If so, can we call out the incident number and resolution in the PR description?

In addition to the query node health, why are we not also checking the storage node health? Or is that already done somewhere else?

Thanks for the questions @onkar

Yes this is the follow up for inc-894. I didn't reference the inc number in a pr because this is a public open source repo and incidents are internal within the company, but I updated the description.

This is technically done by the ingestion consumers, they're the only ones directly communicating with those nodes. If they are not able to perform insert they will exit with a bad status code, forcing k8s to rotate them. Checking them here would not be appropriate here because we still want to serve request to existing data even if ingestion stopped.

Thanks for the response, @john-z-yang.

Oct 14 '24 17:10 onkar

snuba snuba copied to clipboard

feat(api): Add a clickhouse connection sanity check to api healthcheck

Overview

snuba
snuba copied to clipboard