arboreto Garbage collection issue with GRNBoost2

I'm running Arboreto's implementation of GRNBoost2 via the pySCENIC command line, but I figured this issue probably belongs here. I get the following warning, which repeats a number of times through the run. It seems like most of the time GRNBoost2 does complete successfully, but it would be nice to avoid the performance hit which this warning seems to imply. Any ideas on how to solve this? I've notice that it may occur more often with larger expression matrices (10,000 cells, 20,000 genes). I'm using dask v1.0.0 if that helps.

distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)

Thanks for any help you can provide. Chris

Jun 24 '19 11:06 cflerin

I have also encountered this issue as well.

Jul 01 '19 18:07 rjb67

Just to follow up on this, I have tried messing around with the repartition_multiplier in the create_graph function. It appears that it's currently set to partition the dask graph to the number of client cores: https://github.com/tmoerman/arboreto/blob/3ff7b6f987b32e5774771751dea646fa6feaaa52/arboreto/core.py#L442-L444

This is my dask graph after repartitioning:

>>> graph
Dask DataFrame Structure:
                    TF  target importance
npartitions=20
                object  object    float64
                   ...     ...        ...
...                ...     ...        ...
                   ...     ...        ...
                   ...     ...        ...
Dask Name: repartition, 80874 tasks

I then tried a bunch of partition settings from no repartitioning at all up to a few thousand, but no matter the setting I would always get these garbage collection warnings. Perhaps it has something to do with the number of tasks instead?

Jul 10 '19 12:07 cflerin

arboreto arboreto copied to clipboard

Garbage collection issue with GRNBoost2

arboreto
arboreto copied to clipboard