NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
**Describe the bug** NeMo Curator is not functioning correctly in a Docker environment. The docker pull command for nvcr.io/nvidia/nemo:24.01.01.framework does not include NeMo Curator. After cloning the NeMo Curator repository...
**Is your feature request related to a problem? Please describe.** Currently many functionalities/tests do not work when dask query planning is enabled (Default dask behavior). This is an issue to...
Experimental change to improve IO performance when multiple json files are mapped to each dask-dataframe partition. **Context**: I was originally exploring a similar optimization to improve remote-storage performance, and found...
**Is your feature request related to a problem? Please describe.** The codebase has some tutorials/examples showcasing CPU only or GPU only modules, but not both. It would be good to...
**Is your feature request related to a problem? Please describe.** We should document installing with rapids nightlies. **Describe alternatives you've considered** We currently have to do below: ```diff diff --git...
I'm trying to run the PII example [here](https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/find_pii_and_deidentify.py). ``` # for gpu python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu # for cpu python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py ``` On CPU, I get memory warnings and eventual...
In debugging a Curator pipeline, I was re-running the same stages multiple times. I was confused when FuzzyDedup succeeded the first time, but failed an assertion every time thereafter: ```...
**Is your feature request related to a problem? Please describe.** NeMo curator supports document datasets as dataframes today and includes some helpers to read from json/parquet files. **Describe the solution...
When scripts finish successfully, there are Dask "errors" that appear in proportion to the number of workers. ``` Writing to disk complete for 3 partitions 2024-03-20 10:31:01,593 - distributed.worker -...
**Is your feature request related to a problem? Please describe.** Currently there is logic in both `get_all_files_under` & read_json that relies on the files being present locally and doesn't work...