seldon-core icon indicating copy to clipboard operation
seldon-core copied to clipboard

feat(scheduler): liveness check to detect deadlocks

Open domsolutions opened this issue 5 months ago • 0 comments

Motivation

We have observed from customers logs, sometimes it appears the scheduler hits a deadlock state and doesn't respond to gRPC requests. Customers then have to manually restart the scheduler to mitigate this.

This PR introduces a liveness check, which will verify there's no deadlocks on any of the critical services. It will attempt to acquire locks and release them. If it fails to acquire a lock, it will block, and the liveness check will timeout and mark as failed, eventually causing a restart.

Summary of changes

  • increased the liveness period to 20 seconds to avoid causing delays in processing control plane events
  • liveness timeout of 10 seconds to give a generous amount of time if scheduler is busy processing events
  • heartbeats on: agent grpc server data-flow-engine grpc server scheduler server experiment svc which will acquire and immediatly release their locks which could cause blocking behaviour

Also introduced a new Makefile target kind-install-scheduler as was taking a long time to test changes using Ansible. This new rule will build the scheduler docker image, and tag it as the same tag as is currently deployed in Kind and will then restart the scheduler. Additionally to speed up the build, I removed the target to run the tests prior to buiding. IMHO this isn't needed and shouldn't be part of the build process, as this is what the pipeline is for.

So to build and deploy the scheduler can now run:

make -C scheduler kind-install-scheduler

We should probably change all other docker images to not run the tests either to speed up builds/deployments to allow for quicker testing feedback.

Checklist

  • [ ] Added/updated unit tests
  • [ ] Added/updated documentation
  • [ ] Checked for typos in variable names, comments, etc.
  • [ ] Added licences for new files

Testing

domsolutions avatar Oct 30 '25 10:10 domsolutions