pangeo-cloud-federation
pangeo-cloud-federation copied to clipboard
zombie clusters
I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.
i.e.:
- I do
cluster.close()
- I restart the kernel
- I run in another notebook:
from dask_gateway import Gateway
g = Gateway()
g.list_clusters()
and the cluster is still there. Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.
Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through cluster = g.connect(g.list_clusters()[i].name)
), so I decided to restart my server entirely.
I went on my home page, pressed stop server, and restarted it.
And on the new server I could still list the zombie clusters with
g.list_clusters()
They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.
After a while - i guess after whatever timeout limit is in place- they disappear.
Can anyone help me with this or tell me how to completely kill these zombie clusters?
Small note: I was using cluster.adapt()
a lot, and now that i use scale
things seem to be more stable. Maybe the adapt
deployment was particular unstable creating all of these issues. I will monitor and see if these instances of persistent zombie clusters diminuish now.
Like - this cluster [ClusterReport<name=prod.9b358575a5c24bf68fb4569dda44c14e, status=RUNNING>] has been hanging out in my server for 2 days? I scaled it down to 0 workers - but is there a way to kill it? This is a bit worrisome, because other people might not be as careful as I am in keeping clusters under control, and we might have an army of zombie clusters eating up space and money.
There is a lingering scheduler pod, so they are consuming some resources:
distributed.core - INFO - Removing comms to tls://10.39.51.3:34977
distributed.scheduler - INFO - Closing worker tls://10.39.19.4:42565
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.39.19.4:42565', name: dask-worker-9b358575a5c24bf68fb4569dda44c14e-4dfvp, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.39.19.4:42565
distributed.scheduler - INFO - Lost all workers
Just to confirm, does doing this do anything?
cluster = g.connect(cluster_name) # from g.list_clusters()
cluster.close()
There is an idle timeout of 1800 seconds, but that's apparently not kicking in.
It does not work - interestingly tho, I can do
cluster = g.connect(cluster_name) # from g.list_clusters()
cluster.scale(2)
and it will scale! which is the very confusing thing!
And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server? or independently from that?
thanks so much Tom - nice to see you here :)
Weird... So you can connect to the cluster? What if you do...
cluster = g.connect(cluster_name)
client = cluster.get_client()
client.shutdown()
In theory that's supposed to tell the scheduler and workers to shut down too.
And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server?
So that's 1800 seconds of "inactivity". I was poking around to see what that is: https://github.com/dask/distributed/blob/08ea96890674d48b90f4e1f92959957e5e362a18/distributed/scheduler.py#L6362-L6379. Basically, checks if any of the workers have tasks they're working on. It'd be useful to surface this through the dashboard, but I don't think it is.
It is independent of when you log out / close your server. Normally, the lifetime of the Dask cluster is tied to the lifetime of your Python kernel. We register some functions to run when the kernel is shut down telling the cluster to shut down. But if the kernel doesn't exit cleanly then those functions may not run. That's what the idle timeout is for, but it's apparently not working in this case.
If you're able to, it'd be interesting to check the state of the scheduler. Maybe something like
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)
and compare that to time.time()
. I see that that code references "unrunable" tasks. I wonder if that would be the culprit.
thanks so much Tom - nice to see you here :)
I'm always following along :)
Ok - lol - it is still there!
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)
gave me
1611770554.0601828
which sounds about right 🤣
client = cluster.get_client()
client.shutdown()
it's hanging.. i will report back if it gets somewhere. I also have a tangential question.
what is the difference betwee
from dask.distributed import Client
client = Client(cluster)
and
client = cluster.get_client()
I think both connect the cluster to my notebook. But in a different way. Also from today when I do
from dask.distributed import Client
I get this warning - it did not happen on friday
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py:1136: VersionMismatchWarning: Mismatched versions found
+-------------+-----------+-----------+---------+
| Package | client | scheduler | workers |
+-------------+-----------+-----------+---------+
| blosc | 1.10.2 | 1.9.2 | None |
| dask | 2021.01.1 | 2.30.0 | None |
| distributed | 2021.01.1 | 2.30.1 | None |
| lz4 | 3.1.3 | 3.1.1 | None |
| msgpack | 1.0.2 | 1.0.0 | None |
+-------------+-----------+-----------+---------+
Notes:
- msgpack: Variation is ok, as long as everything is above 0.6
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
what is the difference between...
Those are should be about the same. I think there's an issue somewhere about standardizing them (dask-gateway used to require cluster.get_client()
, but now it's not required).
I think warning is safe to ignore... It's only going to happening right now because the zombie cluster was created a while ago and we updated the image in the meantime. Your client is on the new image and your zombie cluster is on the old one.
It's been 24 min and client.shutdown()
is still hanging. So I am not sure that will work. I am happy to let it going and ssee if eventually shuts it down, but I don't think it's promising.
still hanging after 2 hours, I would say that client.shutdown()
doesn't work. Feel free to kill it!
OK, I killed it.
On Feb 1, 2021, at 1:26 PM, Chiara Lepore [email protected] wrote:
still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/914#issuecomment-771098566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITIOG3R4DIL3IQWA3LS4355JANCNFSM4WDAFVUQ.
not sure if this is still coming up, but i've noticed on the AWS hub in a clean session i often see a pending cluster still around:
from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()
[ClusterReport<name=icesat2-staging.45dc9798954c48fca2b3580b6a6104ea, status=PENDING>]
The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')
The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')
Thanks @scottyhq !!! Next time I will try this. If it consistently works we should add it to the notes. I am keeping the issue open because I plan to add some text about this issue in the documentation... at some point!
that command killed a zombie cluster right away! great!