NeMo-Curator issues

Remove dependency on `convert_str_id_to_int` in FuzzyDedup Scripts

**Is your feature request related to a problem? Please describe.** During the minhash script we implicitly convert str id to 2 int ids (doc_id + dataset_id). This is different from...

praateekmahajan

enhancement

jira

Fuzzy Duplicates Identification fails on batched_merge_and_write when document dataset is read with blocksize

1

When reading dataset with `DocumentDataset.read_parquet(..., blocksize=???, files_per_partition=None)` and running fuzzy dedup, `protocol=ucx` `false positive=on` we run into an error during the `shuffle_docs_on_buckets` -> `_batched_merge_and_write` step ```python Stage3 (False Postive Check):...

praateekmahajan

bug

jira

Post to internal slack if nightly tests fail

**Is your feature request related to a problem? Please describe.** If nightly scheduled tests fail then we would like to be notified on slack. **Describe the solution you'd like** Code...

praateekmahajan

enhancement

jira

[FEA] Enable Best Fit Packing

We should look into enabling best fit packing dataset curation feature. This was used by deepseek and seems like we can use our existing bin packing features to enable it...

VibhuJawa

enhancement

jira

Refactor separate_by_metadata and Partition On to use the same code paths.

**Is your feature request related to a problem? Please describe.** We are adding partition_on (https://github.com/NVIDIA/NeMo-Curator/pull/519) here which is very similar to `separate_by_metadata`, we should try to refactor `separate_by_metadata` and make...

VibhuJawa

enhancement

jira

Extend support to non-English languages for PII Deidentifier

1

**Is your feature request related to a problem? Please describe.** My team is currently working on removing PII information from text data that are in South East Asian languages. When...

hamsarajan

enhancement

jira

Add Regex Modifier

1

## Description Add a modifier that performs regex replacements. ## Usage ``` regex_params = [ {"pattern": "ö", "repl": "o"}, { "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]", "repl": "", }, ] modifier = RegexModifier(regex_params)...

shuoyangd

Add a way to pass expected language to FastTextLangId filter

## Description Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be...

shuoyangd

Create `Cache` class for exact, fuzzy, and semantic deduplication

2

TODO: - [x] Exact deduplication files - [x] Semantic deduplication files - [x] Fuzzy deduplication files - [x] Tutorials folder

sarahyurick

gpuci

Hard negative mining for Retriever fine-tuning

## Description Provides functionality to create training datasets for retriever customization ## Usage 1. Semantically cluster documents into partitions: ```python python3 repartition.py --input-dir= --hard-negative-mining-config= --output-dir= --api-key= ``` 2. Mine hard...

vinay-raman

NeMo-Curator
NeMo-Curator copied to clipboard

Metadata

Remove dependency on `convert_str_id_to_int` in FuzzyDedup Scripts

Fuzzy Duplicates Identification fails on batched_merge_and_write when document dataset is read with blocksize

Post to internal slack if nightly tests fail

[FEA] Enable Best Fit Packing

Refactor separate_by_metadata and Partition On to use the same code paths.

Extend support to non-English languages for PII Deidentifier

Add Regex Modifier

Add a way to pass expected language to FastTextLangId filter

Create `Cache` class for exact, fuzzy, and semantic deduplication

Hard negative mining for Retriever fine-tuning

← Metadata

Owner

Metadata

NeMo-Curator NeMo-Curator copied to clipboard

Metadata

← Metadata

Owner

Metadata

NeMo-Curator
NeMo-Curator copied to clipboard