grist-core icon indicating copy to clipboard operation
grist-core copied to clipboard

Shutdown Doc worker when it is not considered as available in Redis #831

Open fflorent opened this issue 1 year ago • 2 comments

Context

In #831, we asked whether we could make a doc worker available again when they can be reached again.

Instead of that, we suggest to just shutdown the worker and assume that another worker would be created instead.

Proposed solution

This PR proposes that the doc worker shutdowns itself in two cases:

  1. the worker is not marked as available anymore in Redis;
  2. the worker fails to join Redis to check its presence more than 3 times;

The check is at most every 30 seconds (30 seconds + the duration of requesting Redis).

fflorent avatar Feb 15 '24 11:02 fflorent

The test failure is not due to the instabilities. I convert this PR into draft until it is fixed.

fflorent avatar Feb 27 '24 13:02 fflorent

The failing test has been fixed: 997ee86e36b6f53d3aa1c14438d60f18ff50239a Also fixed an issue while running the Smoke tests when the environment is in French: fe3ec42b29c6a26e707731f042df84c3795a03e1

fflorent avatar Feb 27 '24 14:02 fflorent

Overall, the logic here feels similar to a container health check. It seems fairly reasonable to expect that a Grist installation with multiple workers is containerized, and probably has health checks running on each Grist instance? In our own deployments we use the /status endpoint as a health check, specifically /status?db=1&redis=1 which checks also for db and redis connectivity. I wonder if it might make sense to add an extra parameter to this endpoint to allow also for making the basic check you have here: if the Grist instance is a doc worker, and that worker has been registered in redis, and that registration is no longer present, then the health check should fail. That means the operator is free to decide how often to run the check, and whether to trust a failure immediately or do retries (though I don't think retries make sense for this particular check).

Not against this change, just proposing an alternative to see what you think @fflorent.

Skimmed the PR, and it looks generally in good shape except it could be worth clearing any pending timeout in the close() method for any IDocWorkerMap implementations so that FlexServers will shut down cleanly every time.

paulfitz avatar Mar 06 '24 21:03 paulfitz

@paulfitz Thanks for your remark! Sounds wise and much simpler, indeed. After discussing with ops, that seems to answer the need.

I applied your requested change, I just kept the upgrade of the Sinon dependency, as it was a painful move and it would remove a barrier to use features of newer versions in the future. If that's OK for you?

fflorent avatar Mar 12 '24 10:03 fflorent

@fflorent any thoughts on the LanguageSettings test failure?

paulfitz avatar Mar 20 '24 21:03 paulfitz

@paulfitz sould be fixed now.

fflorent avatar Mar 21 '24 07:03 fflorent

@paulfitz Right, fixed, rebased and pushed!

fflorent avatar Apr 03 '24 21:04 fflorent