SafeQueue icon indicating copy to clipboard operation
SafeQueue copied to clipboard

Increase resilience of worker by breaking out the "fetch job" vs "do job" parts

Open cooperaj opened this issue 6 years ago • 1 comments

In a situation where the worker fails to be able to communicate with it's job provider i.e. Redis it will fail with an exception. The only way to get it to resolve a new Redis server (assuming some sort of HA setup) is to restart the worker - this is because PHP, rather helpfully, caches name lookups.

By breaking out fetching of jobs from doing of jobs you're able to catch that possible case and cause the worker to exit (and be restarted by whatever scheduling tool you're using). In our situation this results in a new resolution to the revived/hot spare Redis instance and a working queue.

cooperaj avatar Oct 18 '18 15:10 cooperaj

Coverage Status

Coverage increased (+11.3%) to 100.0% when pulling cc6db26f8d0211e3a5ec6e2ec147525e074ec81c on UniversityOfNottingham:feature/0.2-make-worker-resiliant into ed2cbf947961a3c6bb9b474865d12b9de8fe2141 on maxbrokman:0.2.

coveralls avatar Oct 19 '18 14:10 coveralls