feat(scheduler): liveness check to detect deadlocks
Motivation
We have observed from customers logs, sometimes it appears the scheduler hits a deadlock state and doesn't respond to gRPC requests. Customers then have to manually restart the scheduler to mitigate this.
This PR introduces a liveness check, which will verify there's no deadlocks on any of the critical services. It will attempt to acquire locks and release them. If it fails to acquire a lock, it will block, and the liveness check will timeout and mark as failed, eventually causing a restart.
Summary of changes
- increased the liveness period to
20 secondsto avoid causing delays in processing control plane events - liveness timeout of
10 secondsto give a generous amount of time if scheduler is busy processing events - heartbeats on:
agent grpc serverdata-flow-engine grpc serverscheduler serverexperiment svcwhich will acquire and immediatly release their locks which could cause blocking behaviour
Also introduced a new Makefile target kind-install-scheduler as was taking a long time to test changes using Ansible. This new rule will build the scheduler docker image, and tag it as the same tag as is currently deployed in Kind and will then restart the scheduler. Additionally to speed up the build, I removed the target to run the tests prior to buiding. IMHO this isn't needed and shouldn't be part of the build process, as this is what the pipeline is for.
So to build and deploy the scheduler can now run:
make -C scheduler kind-install-scheduler
We should probably change all other docker images to not run the tests either to speed up builds/deployments to allow for quicker testing feedback.
Checklist
- [ ] Added/updated unit tests
- [ ] Added/updated documentation
- [ ] Checked for typos in variable names, comments, etc.
- [ ] Added licences for new files