NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

**Describe the bug** When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out. **Steps/Code to reproduce bug** 1) Clone the repo 2) Run...

bug

**Describe the bug** Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones. Right now if users accidentally pass in the wrong...

enhancement
jira

**Is your feature request related to a problem? Please describe.** **separate_by_metadata.py** script reads all the files at once, and distributes them through the different Dask workers. That could lead to...

enhancement
jira

Right now, `DocumentDataset` has a couple of `read_*` functions: (1) ``` def read_json( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, ) ``` (2) ``` def read_parquet( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, )...

enhancement
jira

As I have been following our Jupyter Notebook tutorials (such as [single_gpu_tutorial.ipynb](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb), I have noticed a lot of small grammar, spelling, and punctuation errors. At some point, I would like...

documentation
good first issue

PR https://github.com/NVIDIA/NeMo-Curator/pull/235 skips `test_uneven_common_crawl_range` because of how flaky it is. In the future, we may want to debug and re-add it. ``` def test_uneven_common_crawl_range(self): start_snapshot = "2021-03" end_snapshot = "2021-11"...

**Is your feature request related to a problem? Please describe.** #77 adds support for longer strings and as a part of those discussions it makes sense to expose some advanced...

enhancement

**Is your feature request related to a problem? Please describe.** Our current Slurm scripts are a combination of 2 bash scripts that might be difficult to understand and customize in...

enhancement
jira

**Describe the bug** ```DocumentDataset.read_parquet``` and ```DocumentDataset.read_json``` fail with unrelated errors when reading directories that also contain files other than JSONL or Parquet. For example, Apache Spark jobs that write data...

bug

**Is your feature request related to a problem? Please describe.** We need to align the character pruning vs sequence length based pruning for our models. We need to ensure our...

enhancement
jira