Sarah Yurick issues

Results 62 issues of


                                            Sarah Yurick

Pandas and cuDF DataFrames in `DocumentDataset`

Right now, there is some confusion around DataFrames being passed into `DocumentDataset`. For now, we expect them to be Dask or Dask-cuDF DataFrames, so we should add stronger type checking...

bug

jira

[FEA] Add support for Multiple Model Quality Classification

In previous versions of NeMo Curator, we supported multiple model quality classification with a combination of Slurm and Python scripts. These scripts were designed to allow the user to pass...

enhancement

Request for adding Python bindings

Hi there, I am very interested in the work being done here! I would like to propose contributing Python bindings to this repository, with the eventual goal of making it...

Multilingual Data Curation tutorial

Update deduplication scripts

When running the [extract_dedup_data.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/semdedup/extract_dedup_data.py#L61) script, the user may encounter a warning: ``` UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`....

documentation

good first issue

Add example of how to resume an interrupted `download_common_crawl` job

Since `download_common_crawl` can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various...

documentation

Consecutive execution of fuzzy deduplication on different columns fails with errors

Python script to reproduce: ``` from functools import partial from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig, Sequential, get_client from nemo_curator.datasets import DocumentDataset def fuzzy_dedupe(dataset, cache_dir, id_field, text_field): # dataset.df.reset_index(drop=True) fuzzy_dedup_config = FuzzyDuplicatesConfig(...

bug

jira

Add more Docker instructions to README

We should add more specific instructions (i.e., docker commands) for how a user can use NeMo Curator via the NeMo Framework Container. It would also be helpful to include instructions...

documentation

Add features from Llama Nemotron tutorial to NeMo Curator modules

While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo...

enhancement

Multilingual PII support

Currently, a user wanting to implement non-English multilingual PII redaction must edit our source code. This is not ideal. The solution should be pretty simple, see: https://github.com/NVIDIA/NeMo-Curator/blob/c909eb3ec43aba0472e655cc8ab473a48543ea11/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst#multilingual-pii-redaction

enhancement

jira