NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Unmanaged memory is high and frozen execution

Open pappagari opened this issue 1 year ago • 3 comments

Describe the bug

The warning

2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 
16.47 GiB -- Worker memory limit: 23.34 GiB 

Even though it is just a warning, the execution freezes after this. I am running tinystories tutorial on 8 cpu workers. This happens after the clean_and_unify step of tinystories tutorial. After freezing, I checked top and it still shows 8 active processes

Steps/Code to reproduce bug

I am trying the tinystories tutorial on the c4 realnewslike dataset.

Download the dataset as follows (obtained from https://huggingface.co/datasets/allenai/c4)

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "realnewslike/*"

The dataset is of size 37G. It contains 513 files each with 26953 entries. I don't have issues running this tutorial on the smaller version of the dataset (2G). Hence I think the warning is likely because of handling large datasets

Expected behavior

Expected it to finish the exection and write the processed data.

Environment overview (please complete the following information)

OS version -- Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64) Python version -- 3.10.15 pip version -- 24.2 dask version -- 2024.7.1 dask_cuda version -- 24.08.02

pappagari avatar Oct 11 '24 17:10 pappagari