distributed icon indicating copy to clipboard operation
distributed copied to clipboard

excessive CPU time spent on gc (even after manually adjusting gc thresholds)

Open olly-writes-code opened this issue 6 years ago • 23 comments
trafficstars

During the process I'm running I very quickly get this warning (for pretty much each worker)

distributed.utils_perf - WARNING - full garbage collections took 36% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)

However, the percent of memory used is by each worker is low -

image

I have plenty of memory to work with so my question is - is it possible dask is aggressively doing gc? If so is it possible to change the gc threshold/collect less aggressively?

I've tried to manually adjust the gc.threshold but this seems to have no effect

g0, g1, g2 = gc.get_threshold()
gc.set_threshold(g0*10, g1*10, g2*10)

(using distributed 1.28.1 and dask 1.2.2)

olly-writes-code avatar Jun 25 '19 04:06 olly-writes-code

I'm curious, what kind of workload are you running? That's a large amount of time spent in garbage collection.

If so is it possible to change the gc threshold/collect less aggressively?

I personally don't know much about how we handle GC, but you might want to look through the following file, which I think includes the code that's active here.

https://github.com/dask/distributed/blob/master/distributed/utils_perf.py

mrocklin avatar Jun 25 '19 06:06 mrocklin

@mrocklin - sorry for the slow response. I'm running lots of parallel simulations. The task is looping through a large array of objects. Adjusting the objects and storing the simulation state. At the end of the simulation, an overall metric is stored, and the large dataset used is disregarded (probably garbage collected).

olly-writes-code avatar Aug 15 '19 03:08 olly-writes-code

in distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%) is the threshold the warning threshold or the threshold for which GC occurs?

olly-writes-code avatar Aug 15 '19 03:08 olly-writes-code

warning. Python's GC runs fairly frequently.

mrocklin avatar Aug 15 '19 12:08 mrocklin

I am experiencing the same on my workstation:

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://127.0.0.1:58788 remote=tcp://127.0.0.1:33619>
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
 cdistributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

It's been at least 20 min printing these warnings and no computation has been done as far as I can say.

Edit 1: After about 30 min it started doing stuff:

Screen Shot 2019-09-27 at 18 56 09

Edit 2: I forgot to mention that I am creating a list of delayed objects that are then called by dask.persist. See here.

muammar avatar Sep 28 '19 01:09 muammar

You have lots of tasks relative to how many workers you have. You might want to look through https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

mrocklin avatar Sep 28 '19 12:09 mrocklin

It's also worth noting that this error message

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

is most often (but not exclusively) the fault of the code that you're running, and not anything to do with Dask. Dask is just in a nice position to let you know if things like that are going on.

mrocklin avatar Sep 28 '19 12:09 mrocklin

Hi,

I am starting with dask on my laptop, going through the dask-tutorial which is on github, but I have similar issue with "distributed.utils_perf - WARNING - full garbage collections took xxx% CPU time recently (threshold: 10%)" always when I start local cluster with the distributed scheduler

  • It is occurring always when I start local cluster even when I submit simple calcualtions
  • different settings of parameters when starting the cluster, like n_workers, memory_limit, threads_per_worker, processes does not seem to have an effect
  • I use conda environment with python 3.7, on laptop with 4 cores, 8 logical processors and 16GB RAM
  • I tried using it on windows and an another machine with linux , no difference
  • the % cpu time in the warning is slowly climbing up and I got this warning almost every second
  • client.restart() does not have effect on this, these warnings show up again immediately
  • it is occurring only with the distributed scheduler
  • after these warnings start to show up and then after shutting down the cluster with client.shutdown() and then starting it again without restarting the ipython kernel, these warnings start to show up without submitting any calculation

Is there any progress on this pls,, or any way to solve this?

jklen avatar Dec 28 '20 13:12 jklen

@jklen Which version of the dask are you using? Is it 2020.12.0? If yes, could you please downgrade to 2.30.0?

I was encountering this issue after upgrading to the latest version ("2020.12.0"), but after down grading these warning messages didn't appear. But, of course, I am not sure.

arnabbiswas1 avatar Jan 11 '21 05:01 arnabbiswas1

hi, have similar problem - use cluster daskdev/dask:2.30.0 with 1 scheduler and 3 workers (deployed via helm chart). Cluster currently is in idle state - just some test tasks form time to time. In few days scheduler use a lot of resources (in idle cluster...) and start spamming warnings

distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.11s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)

Also i use web checks for dashboard - they start to fail in few days after scheduler start too.

scheduler:
  image:
    repository: "daskdev/dask"
    tag: 2.30.0
  resources:
    limits:
      cpu: 1.8
      memory: 6G
    requests:
      cpu: 1.8
      memory: 6G

Scheduler utilization in idle state image Scheduler cpu usage graph image Scheduler memory usage graph image

andrii29 avatar Jan 12 '21 07:01 andrii29

Hi, just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:

    def _gc_callback(self, phase, info):
        # Young generations are small and collected very often,
        # don't waste time measuring them
        if info["generation"] != 2:
            return

For reference, my code is:

from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings

benjaminvatterj avatar Jan 12 '21 16:01 benjaminvatterj

I get the same messages, and I don't think it could be the fault of my code, as it happens when just using .unique() on a dask.dataframe.core.Series object. All of a sudden I'm continually bombarded with this message, and over time the CPU time percentage slowly trickles up.

my_df["my_column"].unique().compute()

gives this output:

full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)

The task doesn't seem difficult, as it completes in a matter of seconds.

mjspeck avatar Apr 30 '21 15:04 mjspeck

I have get the same problem while processing large sets of data. For my case, the warning did not affect my output.

I somehow get away by allocating more workers to the job (used to be only one worker).

I am not sure about the implications and consequences (I also haven't test the speed before and after).

Hope that helps.

RayAtUofT avatar May 20 '21 00:05 RayAtUofT

Same problem

LiutongZhou avatar Apr 27 '22 19:04 LiutongZhou

I'm running into the same problem whilst running modin.pandas. I've installed dask version 2022.1.1 for python 3.9. Seems like dask grinds my calculations to a slow trickle.

acnash avatar May 03 '22 13:05 acnash

Just disable garbage collection. You may need to restart workers as memory fills

gc.disable()

kahemker avatar Oct 27 '22 19:10 kahemker

Hi,

I'm having the same problem. I'm on a system with 64 cores around 1TB of memory. I'm only using 32 cores but I'm working with a VERY LARGE dataset (~1.3TB) of text and doing some preprocessing using SpaCy. My dask version is 2022.7.0. I'm working on a Jupyter notebook and I just get this warning and even after more than hour, I don't see any progress on the dashboard. I tried disabling the garbage collection, but that didn't work for me. Any other workarounds or possible solutions?

sudarshan85 avatar Feb 01 '23 02:02 sudarshan85

I don't see any progress on the dashboard.

@sudarshan85 It's very likely that you are encountering a different issue and the GC warning log is unrelated. I suggest to try a more recent version of dask and open a new issue if this still persists. If there is no progress at all, this is a critical issue we'd like to get to the bottom of

fjetter avatar Feb 09 '23 15:02 fjetter

Is there a way to change the threshold? Or just disable this specific warning? I don't want to set logging to errors only (because other warnings might be relevant)

Danferno avatar Feb 23 '23 08:02 Danferno

The solution I found is to restart the client before every compute.

# Clean up old garbage client.restart() ... Your code

DeusNexus avatar Oct 07 '23 05:10 DeusNexus

I have been bitten by this as well. My use case is processing pairs of ICESat-2 and CryoSat-2 data files to extract overlapping sensor data that occur within a narrow time window. The task is treated as embarrassingly parallel, and the processing function is given one CryoSat-2 filename and one or more ICESat-2 filenames to process, but only one IS2 and CS2 files are open at any given time.

One thing I have done to help is to set threads_per_worker=1. With threads_per_worker=2 it gives me these warnings after about 2% processing. Even with threads_per_worker=1, it happens, but appear to happen much further along.

It would be nice to find a proper resolution for this.

I will also add that I just created a new clean environment and reinstalled dask, etc., which installed version 2023.12.1

ebo avatar Dec 23 '23 09:12 ebo

Same issue here

DanyaLearning avatar Feb 06 '24 16:02 DanyaLearning

The solution I found is to restart the client before every compute.

Clean up old garbage client.restart() ... Your code

@DeusNexus, Are you saying that you are doing this inside of the futures function map? Is that safe? I think I am misunderstanding your solution.

ebo avatar Feb 06 '24 19:02 ebo