clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Zombie Workers with AWS Auto Scaler

Open ngessert opened this issue 1 year ago • 3 comments

Describe the bug

I am running a self-hosted ClearML server.

When using the ClearML AWS AutoScaler, I encounter "zombie workers". If the AutoScaler scales down an instance, the worker running on that instance does not disappear. Meaning, it still shows up in the ClearML UI. Example:

image

Also, the Auto Scaler still recognizes these zombie workers when getting all workers: for worker in self.api_client.workers.get_all():

The instance where that worker is running has been terminated for ~5-10 minutes. After ~10 minutes, the zombie workers usually start to disappear.

This has several adverse effects:

  • If the worker's queue is empty, the Auto Scaler will continuously try to scale down the instance associated with that worker. However, that instance has been terminated a long time ago. You start getting these kind of responses: 2024-02-08 13:14:24,222 - clearml.auto_scaler - INFO - up machines: defaultdict(<class 'int'>, {'m5xlarge': -23, 'g4dnxlarge_gpu': -19, 'g4dn2xlarge_gpu': -10}) - the Auto Scaler always substracts one instance when it does a terminate call. But there's nothing to terminate.
  • If the queue get's refilled by experiments, the Auto Scaler does not launch new instance, because it thinks there is still a functioning worker.

To reproduce

Honestly, not sure. I am using a self-hosted ClearML server version: WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28.

Expected behaviour

Workers should be de-registered when the AWS instance is terminated. Or, it should be possible to de-register workers via an API call.

Environment

  • Server type self hosted
  • ClearML SDK Version 1.13.1
  • ClearML Server Version WebApp: 1.14.0-431 • Server: 1.14.0-431 • API: 2.28
  • Python Version 3.8
  • OS Linux

Related Discussion

If this continues a slack thread, please provide a link to the original slack thread.

ngessert avatar Feb 08 '24 13:02 ngessert

Hi @ngessert, what you're seeing is the remnant of the worker that used to run on the machine - the last report has a timeout of 10 minutes (which is why it goes away after some time). The question is why that worker did not shut down correctly (a proper shut-down should unregister the worker) - perhaps a full log of the autoscaler will help in understanding this (or the system log of the related cloud machine)

jkhenning avatar Feb 08 '24 14:02 jkhenning

The last line of the EC2 instance's system log is just this (after instance was terminated): [ 88.024534] cloud-init[1966]: + python -m clearml_agent --config-file /home/ec2-user/clearml.conf daemon --queue aws_m5xlarge

How does the unregister process work? Does the agent recognize somehow that the system where it is running is being shut down? Is it perhaps dependent on the OS? I am using Amazon Linux 2023.

ngessert avatar Feb 09 '24 10:02 ngessert

@jkhenning any suggestion where I could start looking for what is going on with worker shutdown/de-registering? From a code point of view (the logs don't show anything)

ngessert avatar Feb 27 '24 13:02 ngessert