snuba
snuba copied to clipboard
feat(api): Add a clickhouse connection sanity check to api healthcheck
Overview
This is a follow up item for inc-894, where a API pod landed on a bad node with no clickhouse connection. Because it was not configured with a thorough healthcheck, the pod stayed and served 5XX to all requests.
Adds a connection check to see if any clickhouse query node is operable. Fails the healthcheck if all query nodes are inoperable, as it likely indicates connection error with the host, so k8 can rotate the pod. Renames the previous check_clickhouse function to check_all_tables_present to distinguish the 2.
LGTM. A couple of questions:
- Is this the liveness probe solution we discussed during the post mortem of inc-894? If so, can we call out the incident number and resolution in the PR description?
- In addition to the query node health, why are we not also checking the storage node health? Or is that already done somewhere else?
Thanks for the questions @onkar
- Yes this is the follow up for inc-894. I didn't reference the inc number in a pr because this is a public open source repo and incidents are internal within the company, but I updated the description.
- This is technically done by the ingestion consumers, they're the only ones directly communicating with those nodes. If they are not able to perform insert they will exit with a bad status code, forcing k8s to rotate them. Checking them here would not be appropriate because we still want to serve request to existing data even if ingestion stopped.
LGTM. A couple of questions:
- Is this the liveness probe solution we discussed during the post mortem of inc-894? If so, can we call out the incident number and resolution in the PR description?
- In addition to the query node health, why are we not also checking the storage node health? Or is that already done somewhere else?
Thanks for the questions @onkar
- Yes this is the follow up for inc-894. I didn't reference the inc number in a pr because this is a public open source repo and incidents are internal within the company, but I updated the description.
- This is technically done by the ingestion consumers, they're the only ones directly communicating with those nodes. If they are not able to perform insert they will exit with a bad status code, forcing k8s to rotate them. Checking them here would not be appropriate here because we still want to serve request to existing data even if ingestion stopped.
Thanks for the response, @john-z-yang.