clearml
clearml copied to clipboard
Zombie Workers with AWS Auto Scaler
Describe the bug
I am running a self-hosted ClearML server.
When using the ClearML AWS AutoScaler, I encounter "zombie workers". If the AutoScaler scales down an instance, the worker running on that instance does not disappear. Meaning, it still shows up in the ClearML UI. Example:
Also, the Auto Scaler still recognizes these zombie workers when getting all workers:
for worker in self.api_client.workers.get_all():
The instance where that worker is running has been terminated for ~5-10 minutes. After ~10 minutes, the zombie workers usually start to disappear.
This has several adverse effects:
- If the worker's queue is empty, the Auto Scaler will continuously try to scale down the instance associated with that worker. However, that instance has been terminated a long time ago. You start getting these kind of responses:
2024-02-08 13:14:24,222 - clearml.auto_scaler - INFO - up machines: defaultdict(<class 'int'>, {'m5xlarge': -23, 'g4dnxlarge_gpu': -19, 'g4dn2xlarge_gpu': -10})
- the Auto Scaler always substracts one instance when it does a terminate call. But there's nothing to terminate. - If the queue get's refilled by experiments, the Auto Scaler does not launch new instance, because it thinks there is still a functioning worker.
To reproduce
Honestly, not sure. I am using a self-hosted ClearML server version: WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
.
Expected behaviour
Workers should be de-registered when the AWS instance is terminated. Or, it should be possible to de-register workers via an API call.
Environment
- Server type self hosted
- ClearML SDK Version 1.13.1
- ClearML Server Version WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
- Python Version 3.8
- OS Linux
Related Discussion
If this continues a slack thread, please provide a link to the original slack thread.
Hi @ngessert, what you're seeing is the remnant of the worker that used to run on the machine - the last report has a timeout of 10 minutes (which is why it goes away after some time). The question is why that worker did not shut down correctly (a proper shut-down should unregister the worker) - perhaps a full log of the autoscaler will help in understanding this (or the system log of the related cloud machine)
The last line of the EC2 instance's system log is just this (after instance was terminated):
[ 88.024534] cloud-init[1966]: + python -m clearml_agent --config-file /home/ec2-user/clearml.conf daemon --queue aws_m5xlarge
How does the unregister process work? Does the agent recognize somehow that the system where it is running is being shut down? Is it perhaps dependent on the OS? I am using Amazon Linux 2023.
@jkhenning any suggestion where I could start looking for what is going on with worker shutdown/de-registering? From a code point of view (the logs don't show anything)