metaflow-service Set matching cache status http code

This endpoint always returns status 200 even if 1 or more caches are not alive. With this change, if 1 of the caches is dead, it will return status 500, which can then be used by health checking tools to restart the service.

I have previously monkey patched https://github.com/Netflix/metaflow-service/pull/292 into the service I maintain but I still found issues where 1+ cache was not restarted correctly. This is hard to reproduce and happens from time to time. With this change, I no longer need to restart the Metaflow service manually.

Aug 30 '22 13:08 ruial

It feels a little janky to return a 500 status code just as a function of the response content here. Is it a possibility for the health check tool to use other info on the response? E.g.

Health check tool inspects the body and does the all(is_alive) check itself.
Metadata service can set an all_caches_alive flag either in response body, or in response header.
A new endpoint made to indicate the need to restart. We have /ping for this purpose today. However we can consider another endpoint with more sophisticated internal criteria.

Aug 30 '22 15:08 jackie-ob

Hey @jackie-ob , setting the HTTP status code >= 400 is the standard way to signal that a service is unhealthy (in AWS ELB and k8s):

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

In the case of ELB I don't think the response body can be evaluated to determine the service health.

You guys are also using this technique correctly in other places, see https://github.com/Netflix/metaflow-service/blob/master/services/metadata_service/api/admin.py#L59

Please let me know if this MR makes sense or if I should change something.

Aug 30 '22 15:08 ruial

What issues were you seeing in the cache? I believe they may be addressed in #327 which should keep the caches alive.

Aug 30 '22 15:08 romain-intel

@ruial we are planning to ship https://github.com/Netflix/metaflow-service/pull/327 as @romain-intel mentioned. Today. Would you like to see if that makes the need for this PR go away altogether?

If it is still a problem, our preference would be to create a new endpoint (similar to what you referenced)

Aug 30 '22 16:08 jackie-ob

Sure, I can try the new version, I think the fix from https://github.com/Netflix/metaflow-service/pull/292/files#diff-17bee700a8b2eb91cc38dcb3d7e30cd88769966fdb21c724393a5746f664d3cf is not included tho. A new endpoint in the future with the purpose of health checking the cache (and maybe the database) would also work. The current issue that I have is that all the cache workers subprocesses get terminated and don't restart, so everything in the UI gets a CacheServerUnreachable exception. I don't know how to reproduce the issue as it happens rarely.

Everything else with Metaflow works well, the only issues we have currently are the cache. Thank you :)

Aug 30 '22 16:08 ruial

metaflow-service metaflow-service copied to clipboard

Set matching cache status http code

metaflow-service
metaflow-service copied to clipboard