streamz icon indicating copy to clipboard operation
streamz copied to clipboard

Dask leaks memory with Batched Kafka and cudf

Open skmatti opened this issue 5 years ago • 5 comments

Dask workers' memory shooting up gradually for long running jobs and eventually job crashes when the memory of workers exceeds 80%(or around). Refer to the image below:

image

Dask is able to process the data at the input rate(600 mbps) and certainly not keeping processed futures in memory as we can infer from the image. I am using 10 secs window for reading messages from Kafka. So, the size of each batch would be ~6GB. But the workers seem to use much higher memory than that.

Dask configuration: 2 nodes (24 core CPU and 1 T4 GPU) 8 workers and 3 threads on each node

dask - 1.2.2 distributed - 1.28.1 cudf - 0.8

skmatti avatar Jun 12 '19 18:06 skmatti

Does this happen with Pandas? Have you been able to narrow it down to a specific function in streamz / cudf that causes this?

kkraus14 avatar Jun 13 '19 17:06 kkraus14

No, it does not happen with Pandas.

So, we’ve not tried out all the functions so far, but all the simple aggregations (max, sum, count, etc.), as well as single-index groupbys seem to work well on both single as well as multiple GPUs. When using multi-index groupbys, the Dask workers’ memory increases until the crash.

chinmaychandak avatar Jun 13 '19 22:06 chinmaychandak

cc @thomcom sounds like we have a memory leak somewhere in the MultiIndex / groupby codebase.

kkraus14 avatar Jun 15 '19 17:06 kkraus14

I'm now not seeing this issue with the latest cudf release. Should we close this now?

jdye64 avatar Sep 04 '19 21:09 jdye64

Yes, we should!

chinmaychandak avatar Sep 04 '19 22:09 chinmaychandak