grist-core
grist-core copied to clipboard
Make a doc worker available again
While working on scaling our instance, we realized that we don't know how to make a removed doc worker available again.
We have this scenario in mind:
- a doc worker has registered itself in Redis;
- the doc worker get busy, for a little while, and responds with latency;
- a home worker attempts to assign a document to this doc worker but failed to join it, it's removed;
- the doc worker is available again, ready to accept documents, but AFAICT it won't register itself again as available;
If you agree that it would be nice that the doc worker gets registered as available again in such a case, what do you think of making the doc worker regularly (say every 5 minutes) tries to inspect the Redis database to ensure it is still registered?
Of course if we go this way, we will be happy to make a PR ourselves.
Hmm the operations on this need to be done carefully. It is important not to end up with two workers that both consider themselves responsible for a document. We set up our stuff years ago so I'm rusty on it, but I think the general flow is:
- Worker gets unhealthy.
- Worker is SIGTERMed - if it is at all responsive, this will trigger flushing any docs not already synced to external storage.
- Then, after a while, the worker is SIGKILLed and disposed of.
- A scaling rule ensures that a new replacement worker will be created.
- Separately, disposal of the worker triggers removal of routing to it from the load balancer.
- URL for checking its status now reaches a generic fallback that confirms its absence unambiguously.
- Removal from pool and unassignment of documents begins.
All of this is done in a very AWS-specific way (ECS Fargate) so no use to you.
Reusing a worker that recovers is possible but it would need care. For example, how would we ensure it knows that it has lost responsibility for all the documents it was previously managing, and knows that early enough to prevent pushing out any data that might conflict with another worker.
Thank you for this explanation. So forcing a new worker to pop up seems to be the safer route to go and your recommended approach.
I thought a worker could be "reused" if it only experienced a temporary network failure. Because the list of documents affected to workers is available in Redis, it seemed to me that a worker registering in Redis again could see which documents it's still responsible of.
I thought a worker could be "reused" if it only experienced a temporary network failure. Because the list of documents affected to workers is available in Redis, it seemed to me that a worker registering in Redis again could see which documents it's still responsible of.
This seems possible, but you'd want to be careful that the worker is aware of its failure, and aware that it needs to check its assumptions against Redis. As currently implemented, my memory is that a worker will assume it remains responsible for a document that was previously assigned to it. That's what makes me nervous of two workers stepping on each other in what they put in external storage.
Thank you for this explanation. So forcing a new worker to pop up seems to be the safer route to go and your recommended approach.
That's what we did in order to work "with the grain" of AWS ECS. In a k8s setup you'd have a lot more freedom in how to manage workers, so I'm not ruling out a fixed pool, it could be more convenient in some ways. At Grist Labs we don't have experience doing it, but our ops team is tiny, a fraction of a person currently.