lookout Add health-check endpoints

Server and Analyzer should expose an HTTP-level endpoints that can be used as kubernetes life-probes.

As discussed, at this point probes could check accessibility of it's dependencies like DB, Bblfsh, etc.

Eventually, it would be really nice for such checks to become more complex and execute some basic scenarios and verify them pass (like a simple live unit-tests).

For now:

[ ] check what kind of life probes k8s supports/recomends
[ ] add endpoints to Server and Dummy analyzer
[x] update charts \w life probes

Aug 23 '18 17:08 bzz

@carlosms We started a discussion in 2 different places: PR and slack. I think it's better to move it here.

The discussions touch different topics, I'll try to summarize them here and give my opinion. I'll start from the "end", specific issues, and move to more general topics.

Proper handling of restarts. Especially posting comments on github.

The main argument about it I noticed: restarting on the middle of results posting would cause problems.

Restart is a very common problem in software development. No matter do they caused intentionally or not. That's why almost all systems already have solutions for it. For databases it's transaction. For a queue it's publish confirmations. Those concepts allow to handle restarts in the middle of something be safe. Rollback would kick and an application can retry the same operation after the restart.

HTTP APIs are usually designed with this issue in mind also. Retry would cause specific error which can be easily handled by the application. Or API would provide some kind of transaction/confirmation also.

In our particular case "posting comments" github provides such API which allows retry without any undesired effects for an end user (like duplicated comments). If we create review without body attribute, it will be visible to an user only after the submit call.

It's safe to call submit multiple times.

To make my point more clear that restart in the middle of posting isn't a problem here is a list of changes how we might improve our poster in the future:

Call create without body if don't have ID of review in the state already
Save returned ID to the state
In case of failure, we can just retry later without any issues for an end user
Send confirm
Update status in DB
In case of failure, we can just safely retry later

Should we consider DB as a critical component of the system or not

The current argument:

Maybe a restart is the better option from our point of view, but from management it's uptime... maybe we prefer to keep posting analysis while the DB is down, and deal with the inconsistent state somehow later when the DB comes up again

According to our requirements, we can't post comments without the state be available. It would cause duplicated comments. Our current codebase doesn't allow not only to post comments but event watch without DB at all. As long as we must satisfy non-duplicated comments business requirement we have to have some state available. For now, it's DB. So I propose to treat it is a critical component from the start.

Should we consider Babelfish as a critical component of the system or not

Our current production analyzers require bblfsh. During the retreat, all but 1 project required bblfsh which makes me think in terms of the whole product that bblfsh is the main component.

Another reason to treat it as a critical component is to avoid the domino effect. Unavailable bblfsh would cause errors not only on our side but on analyzers side too. Though we can avoid it by introducing more complexity like remembering which analyzers require bblfsh and send requests only to ones without this requirement, adding CircuitBreakers, retries and other stuff. But restarts also solve these problems and much easier to implement on the current stage of the project.

For now I would fail in case bblfsh isn't available and improve it later with better solutions.

The idea of complete graceful handling errors vs let it crash

Nobody argues that errors which can be handled gracefully should be handled like that. Though there are lots of errors which we as developers can't think about during development. And it gets much worse as soon as the software includes more than 1 component and even worse when we make it distributed.

Apparently, most of the bugs which happen in production are heisenbugs. In the Why Do Computers Stop and What Can Be Done About It? paper author discovered in his experiments that only 1 bug out of 132 was caused by bug which can be easily reproduced and restart doesn't fix it.

Kubernetes gives us power of fast automatic restart and allow to handle critical errors in a similar way to Erlang. Supervisor would restart services without human involvement which solves a lot of problems. Here an example of a known bug and a theoretical bug which it solves but there are tons of other bugs we didn't discover yet or can't imagine.

we clone repositories and put them on disk. But never clean. Sooner or later container would run out of free space. The restart solves it without waking up Rafa or us in the middle of the night.
we could forget to close a transaction in some cases. This would lead to running out of available connections. The restart solves it without waking up Rafa or us in the middle of the night.

Supervisor with restarts increases robustness, availability and reliability. This is the reason why this issue exists in the first place.

I also add few links here you might be interested to check:

Sep 14 '18 13:09 smacker

There is already (for workers and watcher):

/health/liveness, returns 200 if it was started
/health/readiness, returns 200 if all startup instructions succeded

And for dummy:

/health/liveness, returns 200 if it was started
/health/readiness, returns 200 if it was started

Is it :point_up: enough for second subtask?

_ add endpoints to Server and Dummy analyzer_

Jan 11 '19 13:01 dpordomingo

I still think DB & Bblfsh are critical components for lookout and we should report non 200 status when they are not available.

Plus the concern of a restart in the middle of posting was solved already by syncing comments with github in case we realized the previous event wasn't finished correctly.

Jan 11 '19 13:01 smacker

I agree with you @smacker about criticalness of DB and bblfsh. would you create two new endpoints for them? or should we adapt /health/readiness to also check the DB and bblfsh?

Jan 11 '19 15:01 dpordomingo

I don't think you can define more than 1 endpoint for readness and liveness checks in kubernetes. So it should be the same endpoints.

Jan 11 '19 16:01 smacker

Intuitively, it would make sense to me that the application signals liveness but not readiness if backend services are not working, but I'm not really sure about the implications, specially when deploying to kubernetes. It would be good to have some input from @rporres, just in case he has some insights of what he would expect from lookout in production.

Jan 17 '19 10:01 smola

If there's any issue with any app hard requirement it makes sense that it does not receive requests. The main issue is to determine if the application it should be restarted for that or not, as a readiness probe fail in k8s won't force the pod restart.

Jan 17 '19 18:01 rporres