NeMo-Curator
NeMo-Curator copied to clipboard
[BUG] download process has memory leak during extraction to jsonl
Describe the bug
whenever i run downlod_common_crawl.py code in examples folder after it downloaded the shards, it starts to extract the data. in between warnings come up which says this code doesnt free the memory and after a while it kills the process.
next problem is this extraction takes so long time for me, it extracts 10 shards in about 1 and 45 minutes. is there any extra configuration that i have missed?
here is the SS of problem
Environment overview
- Environment location: local server
- Method of NeMo-Curator install: pip install --extra-index-url https://pypi.nvidia.com .
Environment details
- OS version : Debian GNU/Linux 11 (bullseye)
- Dask version : 2024.1.1
- Python version: 3.10.14