Sarah Yurick

Results 62 issues of Sarah Yurick

Right now, there is some confusion around DataFrames being passed into `DocumentDataset`. For now, we expect them to be Dask or Dask-cuDF DataFrames, so we should add stronger type checking...

bug
jira

In previous versions of NeMo Curator, we supported multiple model quality classification with a combination of Slurm and Python scripts. These scripts were designed to allow the user to pass...

enhancement

Hi there, I am very interested in the work being done here! I would like to propose contributing Python bindings to this repository, with the eventual goal of making it...

When running the [extract_dedup_data.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/semdedup/extract_dedup_data.py#L61) script, the user may encounter a warning: ``` UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`....

documentation
good first issue

Since `download_common_crawl` can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various...

documentation

Python script to reproduce: ``` from functools import partial from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig, Sequential, get_client from nemo_curator.datasets import DocumentDataset def fuzzy_dedupe(dataset, cache_dir, id_field, text_field): # dataset.df.reset_index(drop=True) fuzzy_dedup_config = FuzzyDuplicatesConfig(...

bug
jira

We should add more specific instructions (i.e., docker commands) for how a user can use NeMo Curator via the NeMo Framework Container. It would also be helpful to include instructions...

documentation

While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo...

enhancement

Currently, a user wanting to implement non-English multilingual PII redaction must edit our source code. This is not ideal. The solution should be pretty simple, see: https://github.com/NVIDIA/NeMo-Curator/blob/c909eb3ec43aba0472e655cc8ab473a48543ea11/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst#multilingual-pii-redaction

enhancement
jira