NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

**Describe the bug** NeMo Curator is not functioning correctly in a Docker environment. The docker pull command for nvcr.io/nvidia/nemo:24.01.01.framework does not include NeMo Curator. After cloning the NeMo Curator repository...

bug
documentation

**Is your feature request related to a problem? Please describe.** Currently many functionalities/tests do not work when dask query planning is enabled (Default dask behavior). This is an issue to...

enhancement

Experimental change to improve IO performance when multiple json files are mapped to each dask-dataframe partition. **Context**: I was originally exploring a similar optimization to improve remote-storage performance, and found...

**Is your feature request related to a problem? Please describe.** The codebase has some tutorials/examples showcasing CPU only or GPU only modules, but not both. It would be good to...

documentation
enhancement
jira

**Is your feature request related to a problem? Please describe.** We should document installing with rapids nightlies. **Describe alternatives you've considered** We currently have to do below: ```diff diff --git...

documentation
enhancement

I'm trying to run the PII example [here](https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/find_pii_and_deidentify.py). ``` # for gpu python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu # for cpu python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py ``` On CPU, I get memory warnings and eventual...

bug

In debugging a Curator pipeline, I was re-running the same stages multiple times. I was confused when FuzzyDedup succeeded the first time, but failed an assertion every time thereafter: ```...

enhancement
jira

**Is your feature request related to a problem? Please describe.** NeMo curator supports document datasets as dataframes today and includes some helpers to read from json/parquet files. **Describe the solution...

enhancement
jira

When scripts finish successfully, there are Dask "errors" that appear in proportion to the number of workers. ``` Writing to disk complete for 3 partitions 2024-03-20 10:31:01,593 - distributed.worker -...

jira

**Is your feature request related to a problem? Please describe.** Currently there is logic in both `get_all_files_under` & read_json that relies on the files being present locally and doesn't work...

enhancement
jira