streamz
streamz copied to clipboard
Dask leaks memory with Batched Kafka and cudf
Dask workers' memory shooting up gradually for long running jobs and eventually job crashes when the memory of workers exceeds 80%(or around). Refer to the image below:
Dask is able to process the data at the input rate(600 mbps) and certainly not keeping processed futures in memory as we can infer from the image. I am using 10 secs window for reading messages from Kafka. So, the size of each batch would be ~6GB. But the workers seem to use much higher memory than that.
Dask configuration: 2 nodes (24 core CPU and 1 T4 GPU) 8 workers and 3 threads on each node
dask - 1.2.2 distributed - 1.28.1 cudf - 0.8
Does this happen with Pandas? Have you been able to narrow it down to a specific function in streamz / cudf that causes this?
No, it does not happen with Pandas.
So, we’ve not tried out all the functions so far, but all the simple aggregations (max, sum, count, etc.), as well as single-index groupbys seem to work well on both single as well as multiple GPUs. When using multi-index groupbys, the Dask workers’ memory increases until the crash.
cc @thomcom sounds like we have a memory leak somewhere in the MultiIndex
/ groupby codebase.
I'm now not seeing this issue with the latest cudf release. Should we close this now?
Yes, we should!