NeMo-Curator
NeMo-Curator copied to clipboard
Unmanaged memory is high and frozen execution
Describe the bug
The warning
2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory:
16.47 GiB -- Worker memory limit: 23.34 GiB
Even though it is just a warning, the execution freezes after this. I am running tinystories tutorial on 8 cpu workers. This happens after the clean_and_unify step of tinystories tutorial.
After freezing, I checked top and it still shows 8 active processes
Steps/Code to reproduce bug
I am trying the tinystories tutorial on the c4 realnewslike dataset.
Download the dataset as follows (obtained from https://huggingface.co/datasets/allenai/c4)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "realnewslike/*"
The dataset is of size 37G. It contains 513 files each with 26953 entries. I don't have issues running this tutorial on the smaller version of the dataset (2G). Hence I think the warning is likely because of handling large datasets
Expected behavior
Expected it to finish the exection and write the processed data.
Environment overview (please complete the following information)
OS version -- Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64) Python version -- 3.10.15 pip version -- 24.2 dask version -- 2024.7.1 dask_cuda version -- 24.08.02