distributed
distributed copied to clipboard
Repeated calls to `memory_color` take around 12% of CPU time of scheduler
Describe the issue:
Some follow-up to: https://github.com/dask/distributed/issues/8761
After fixing above issue already in https://github.com/dask/distributed/pull/8762, the next big thing that takes very much CPU power with a scheduler with lots of workers (>2000), are the calls to _cluster_memory_color, more specifically _memory_color.
https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/dashboard/components/scheduler.py#L391
As far as I can see, this is about coloring the memory bar of a specific worker depending if it's deemed "good", "almost full" or "full".
Again, speedscope stuff (this was without the fix from PR 8762):
Is this something that could be solved by binning the memory load & size (surely coloring doesn't have to be so exact that is has to be based on exact bytes of memory) and caching the result of this memory coloring process too?
Surely, one don't has to recalculate which color a worker process with for example 1024/4096MiB RAM shall have hundreds of times per second, especially since the coloring result doesn't change at all.
Environment:
- Dask version: 2024.7.0
- Python version: 3.10
- Operating System: Linux, Debian
- Install method (conda, pip, source):
poetry/pip
Ah, so what takes time here is asking the scheduler what the current memory situation across all workers is. This also happens on the next biggest call, taking around 7.0% CPU time: https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/dashboard/components/scheduler.py#L413
So there are at least two separate calls which result in measuring the total cluster memory situation, is that correct? Is that something that one could cache at least per tick or even calculate only one every 10 ticks or so?
Other tangential question: Can the tick rate of the dashboard be adjusted somehow? I would be totally fine with an update every 10s or so, I'm mostly using internal Grafana dashboards fueled from the Prometheus metrics anyways. On top of that, we have very long running tasks, so frequent updates are not even helping.
There is --no-show for entirely disabling the dashboard, but this parameter is apparently unused :(
https://github.com/dask/distributed/blob/782050a3a4cf2abd450caa8adfaa912c22829e78/distributed/cli/dask_scheduler.py#L127
Using --no-dashboard will disable metrics too, which is undesirable :/ We'd want prometheus metrics that we can look at asynchronously.
Other tangential question: Can the tick rate of the dashboard be adjusted somehow? I would be totally fine with an update every 10s or so, I'm mostly using internal Grafana dashboards fueled from the Prometheus metrics anyways. On top of that, we have very long running tasks, so frequent updates are not even helping.
This is not currently exposed. It's also a bit messy since the update interval for every widget is controlled individually, see for example https://github.com/dask/distributed/blob/8564dc79a1b9902eb0320d51b034e0623a2afe8b/distributed/dashboard/scheduler.py#L74-L116
However, some docs are hard coding this, e.g. system_monitor here https://github.com/dask/distributed/blob/8564dc79a1b9902eb0320d51b034e0623a2afe8b/distributed/dashboard/components/scheduler.py#L4543-L4551
Using --no-dashboard will disable metrics too, which is undesirable :/ We'd want prometheus metrics that we can look at asynchronously.
No, metrics are still available. At least when I try they are. You can check by opening http://localhost:<dashboard_port>/metrics
This is not currently exposed. It's also a bit messy since the update interval for every widget is controlled individually
Hm, would be nice if that could be controlled somehow, but yeah, one would have to spend a bit of time here to make all of these values configurable, right? Maybe one could include a single configurable parameter that we'd multiply/divide all of these update intervals by?
No, metrics are still available.
Ah, sorry! Just started a pipeline with --no-dashboard, the metrics indeed seem to be there. I recall that this was the case for the worker dashboards, it seemed metrics only became available when you'd enable worker dashboards.
I can keep you updated as soon as I start a really large pipeline again what the scheduler health then looks like and where it's spending a lot of compute.
Hey @fjetter ! We tried a bunch of even larger scale workloads (>3k workers), tried to optimize our stack a bit by not submitting too many tasks at once, disabled the dashboard (--no-dashboard) and got the clusters to be quite stable :)
With these measures, we saw scheduler tick durations of ~3-30ms and scheduler CPU load of ~5-35%, with latter being on the lower side more often.
That is quite okay for us right now. I have to say though that enabling the dashboard and actually using it/having it open in a browser will lead to the scheduler being 100% pegged almost immediately and also the dashboard loosing connection/not updating anymore quite often.
EDIT: Nevertheless I'd still be interested in either slowing down dashboard updates through a config or improving the performance of the compute-heavy parts of the scheduler. Let me know if you know a nice place to start improving something!