NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
**Describe the bug** When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out. **Steps/Code to reproduce bug** 1) Clone the repo 2) Run...
**Describe the bug** Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones. Right now if users accidentally pass in the wrong...
**Is your feature request related to a problem? Please describe.** **separate_by_metadata.py** script reads all the files at once, and distributes them through the different Dask workers. That could lead to...
Right now, `DocumentDataset` has a couple of `read_*` functions: (1) ``` def read_json( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, ) ``` (2) ``` def read_parquet( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, )...
As I have been following our Jupyter Notebook tutorials (such as [single_gpu_tutorial.ipynb](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb), I have noticed a lot of small grammar, spelling, and punctuation errors. At some point, I would like...
PR https://github.com/NVIDIA/NeMo-Curator/pull/235 skips `test_uneven_common_crawl_range` because of how flaky it is. In the future, we may want to debug and re-add it. ``` def test_uneven_common_crawl_range(self): start_snapshot = "2021-03" end_snapshot = "2021-11"...
**Is your feature request related to a problem? Please describe.** #77 adds support for longer strings and as a part of those discussions it makes sense to expose some advanced...
**Is your feature request related to a problem? Please describe.** Our current Slurm scripts are a combination of 2 bash scripts that might be difficult to understand and customize in...
**Describe the bug** ```DocumentDataset.read_parquet``` and ```DocumentDataset.read_json``` fail with unrelated errors when reading directories that also contain files other than JSONL or Parquet. For example, Apache Spark jobs that write data...
**Is your feature request related to a problem? Please describe.** We need to align the character pruning vs sequence length based pruning for our models. We need to ensure our...