NeMo-Curator issues

[BUG] Fuzzy deduplication fails on datasets with no duplicates

4

**Describe the bug** When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out. **Steps/Code to reproduce bug** 1) Clone the repo 2) Run...

Maghoumi

bug

[BUG] Better error/checks around input types being CPU/GPU

1

**Describe the bug** Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones. Right now if users accidentally pass in the wrong...

ayushdg

enhancement

jira

[FEA] Add batched files reading to separate_by_metadata.py

2

**Is your feature request related to a problem? Please describe.** **separate_by_metadata.py** script reads all the files at once, and distributes them through the different Dask workers. That could lead to...

miguelusque

enhancement

jira

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions

6

Right now, `DocumentDataset` has a couple of `read_*` functions: (1) ``` def read_json( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, ) ``` (2) ``` def read_parquet( cls, input_files, backend="pandas", files_per_partition=1, add_filename=False, )...

sarahyurick

enhancement

jira

Grammar and punctuation nits in Jupyter Notebooks

3

As I have been following our Jupyter Notebook tutorials (such as [single_gpu_tutorial.ipynb](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb), I have noticed a lot of small grammar, spelling, and punctuation errors. At some point, I would like...

sarahyurick

documentation

good first issue

Re-add `test_uneven_common_crawl_range` PyTest

PR https://github.com/NVIDIA/NeMo-Curator/pull/235 skips `test_uneven_common_crawl_range` because of how flaky it is. In the future, we may want to debug and re-add it. ``` def test_uneven_common_crawl_range(self): start_snapshot = "2021-03" end_snapshot = "2021-11"...

sarahyurick

Make `max_text_bytes_per_part` configurable

**Is your feature request related to a problem? Please describe.** #77 adds support for longer strings and as a part of those discussions it makes sense to expose some advanced...

ayushdg

enhancement

Explore Dask jobque's slurm runner for multi node slurm setups.

1

**Is your feature request related to a problem? Please describe.** Our current Slurm scripts are a combination of 2 bash scripts that might be difficult to understand and customize in...

ayushdg

enhancement

jira

DocumentDataset read errors when other files are present in directory

**Describe the bug** ```DocumentDataset.read_parquet``` and ```DocumentDataset.read_json``` fail with unrelated errors when reading directories that also contain files other than JSONL or Parquet. For example, Apache Spark jobs that write data...

ronjer30

bug

[FEA] Align the character pruning vs sequence length based pruning for our models.

**Is your feature request related to a problem? Please describe.** We need to align the character pruning vs sequence length based pruning for our models. We need to ensure our...

VibhuJawa

enhancement

jira

NeMo-Curator
NeMo-Curator copied to clipboard

Metadata

[BUG] Fuzzy deduplication fails on datasets with no duplicates

[BUG] Better error/checks around input types being CPU/GPU

[FEA] Add batched files reading to separate_by_metadata.py

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions

Grammar and punctuation nits in Jupyter Notebooks

Re-add `test_uneven_common_crawl_range` PyTest

Make `max_text_bytes_per_part` configurable

Explore Dask jobque's slurm runner for multi node slurm setups.

DocumentDataset read errors when other files are present in directory

[FEA] Align the character pruning vs sequence length based pruning for our models.

← Metadata

Owner

Metadata

NeMo-Curator NeMo-Curator copied to clipboard

Metadata

← Metadata

Owner

Metadata

NeMo-Curator
NeMo-Curator copied to clipboard