pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

zombie clusters

Open chiaral opened this issue 4 years ago • 14 comments

I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.

i.e.:

  1. I do cluster.close()
  2. I restart the kernel
  3. I run in another notebook:
from dask_gateway import Gateway
g = Gateway()
g.list_clusters()

and the cluster is still there. Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.

Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through cluster = g.connect(g.list_clusters()[i].name) ), so I decided to restart my server entirely.

I went on my home page, pressed stop server, and restarted it. And on the new server I could still list the zombie clusters with g.list_clusters() They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.

After a while - i guess after whatever timeout limit is in place- they disappear.

chiaral avatar Jan 14 '21 20:01 chiaral

Can anyone help me with this or tell me how to completely kill these zombie clusters?

chiaral avatar Jan 28 '21 02:01 chiaral

Small note: I was using cluster.adapt() a lot, and now that i use scale things seem to be more stable. Maybe the adapt deployment was particular unstable creating all of these issues. I will monitor and see if these instances of persistent zombie clusters diminuish now.

chiaral avatar Jan 28 '21 03:01 chiaral

Like - this cluster [ClusterReport<name=prod.9b358575a5c24bf68fb4569dda44c14e, status=RUNNING>] has been hanging out in my server for 2 days? I scaled it down to 0 workers - but is there a way to kill it? This is a bit worrisome, because other people might not be as careful as I am in keeping clusters under control, and we might have an army of zombie clusters eating up space and money.

chiaral avatar Jan 30 '21 20:01 chiaral

There is a lingering scheduler pod, so they are consuming some resources:

distributed.core - INFO - Removing comms to tls://10.39.51.3:34977
distributed.scheduler - INFO - Closing worker tls://10.39.19.4:42565
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.39.19.4:42565', name: dask-worker-9b358575a5c24bf68fb4569dda44c14e-4dfvp, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.39.19.4:42565
distributed.scheduler - INFO - Lost all workers

Just to confirm, does doing this do anything?

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.close()

There is an idle timeout of 1800 seconds, but that's apparently not kicking in.

TomAugspurger avatar Jan 30 '21 22:01 TomAugspurger

It does not work - interestingly tho, I can do

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.scale(2)

and it will scale! which is the very confusing thing!

And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server? or independently from that?

thanks so much Tom - nice to see you here :)

chiaral avatar Jan 31 '21 00:01 chiaral

Weird... So you can connect to the cluster? What if you do...

cluster = g.connect(cluster_name)
client = cluster.get_client()
client.shutdown()

In theory that's supposed to tell the scheduler and workers to shut down too.

And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server?

So that's 1800 seconds of "inactivity". I was poking around to see what that is: https://github.com/dask/distributed/blob/08ea96890674d48b90f4e1f92959957e5e362a18/distributed/scheduler.py#L6362-L6379. Basically, checks if any of the workers have tasks they're working on. It'd be useful to surface this through the dashboard, but I don't think it is.

It is independent of when you log out / close your server. Normally, the lifetime of the Dask cluster is tied to the lifetime of your Python kernel. We register some functions to run when the kernel is shut down telling the cluster to shut down. But if the kernel doesn't exit cleanly then those functions may not run. That's what the idle timeout is for, but it's apparently not working in this case.

If you're able to, it'd be interesting to check the state of the scheduler. Maybe something like

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

and compare that to time.time(). I see that that code references "unrunable" tasks. I wonder if that would be the culprit.

thanks so much Tom - nice to see you here :)

I'm always following along :)

TomAugspurger avatar Jan 31 '21 02:01 TomAugspurger

Ok - lol - it is still there!

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

gave me 1611770554.0601828

which sounds about right 🤣

client = cluster.get_client()
client.shutdown()

it's hanging.. i will report back if it gets somewhere. I also have a tangential question.

what is the difference betwee

from dask.distributed import Client
client = Client(cluster)

and

client = cluster.get_client()

I think both connect the cluster to my notebook. But in a different way. Also from today when I do

from dask.distributed import Client

I get this warning - it did not happen on friday

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py:1136: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------+---------+
| Package     | client    | scheduler | workers |
+-------------+-----------+-----------+---------+
| blosc       | 1.10.2    | 1.9.2     | None    |
| dask        | 2021.01.1 | 2.30.0    | None    |
| distributed | 2021.01.1 | 2.30.1    | None    |
| lz4         | 3.1.3     | 3.1.1     | None    |
| msgpack     | 1.0.2     | 1.0.0     | None    |
+-------------+-----------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

chiaral avatar Feb 01 '21 17:02 chiaral

what is the difference between...

Those are should be about the same. I think there's an issue somewhere about standardizing them (dask-gateway used to require cluster.get_client(), but now it's not required).

I think warning is safe to ignore... It's only going to happening right now because the zombie cluster was created a while ago and we updated the image in the meantime. Your client is on the new image and your zombie cluster is on the old one.

TomAugspurger avatar Feb 01 '21 17:02 TomAugspurger

It's been 24 min and client.shutdown() is still hanging. So I am not sure that will work. I am happy to let it going and ssee if eventually shuts it down, but I don't think it's promising.

chiaral avatar Feb 01 '21 17:02 chiaral

still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!

chiaral avatar Feb 01 '21 19:02 chiaral

OK, I killed it.

On Feb 1, 2021, at 1:26 PM, Chiara Lepore [email protected] wrote:

still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/914#issuecomment-771098566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITIOG3R4DIL3IQWA3LS4355JANCNFSM4WDAFVUQ.

TomAugspurger avatar Feb 02 '21 00:02 TomAugspurger

not sure if this is still coming up, but i've noticed on the AWS hub in a clean session i often see a pending cluster still around:

from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()

[ClusterReport<name=icesat2-staging.45dc9798954c48fca2b3580b6a6104ea, status=PENDING>]

The following command gets rid of it: gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

scottyhq avatar Mar 03 '21 03:03 scottyhq

The following command gets rid of it: gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

Thanks @scottyhq !!! Next time I will try this. If it consistently works we should add it to the notes. I am keeping the issue open because I plan to add some text about this issue in the documentation... at some point!

chiaral avatar Mar 05 '21 15:03 chiaral

that command killed a zombie cluster right away! great!

chiaral avatar Mar 10 '21 22:03 chiaral