NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[BUG] download process has memory leak during extraction to jsonl

Open zahramahani opened this issue 10 months ago • 0 comments

Describe the bug

whenever i run downlod_common_crawl.py code in examples folder after it downloaded the shards, it starts to extract the data. in between warnings come up which says this code doesnt free the memory and after a while it kills the process.

next problem is this extraction takes so long time for me, it extracts 10 shards in about 1 and 45 minutes. is there any extra configuration that i have missed?

here is the SS of problem

Screenshot from 2024-04-19 00-17-05

Environment overview

  • Environment location: local server
  • Method of NeMo-Curator install: pip install --extra-index-url https://pypi.nvidia.com .

Environment details

  • OS version : Debian GNU/Linux 11 (bullseye)
  • Dask version : 2024.1.1
  • Python version: 3.10.14

zahramahani avatar Apr 20 '24 09:04 zahramahani