NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

**Describe the bug** If users/NeMo-Curator imports spacy or a module that transitively imports `thinc` before cluster creation it might lead to situations where only 1 of all available GPUs are...

bug
wontfix
jira

Added batched files reading support to **separate_by_metadata.py**, in order to avoid OOMs. In the current implementation, all the files are read at once, and distributed to the workers. With this...

**Describe the bug** Calling jaccard_shuffle on an output directory that already contains shuffle docs from a previous run leads to errors ``` assert bucket_part_start_offset % parts_per_bucket_batch == 0 AssertionError ```

bug
jira

**Describe the bug** By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes...

bug
jira

**Describe the bug** whenever i run downlod_common_crawl.py code in examples folder after it downloaded the shards, it starts to extract the data. in between warnings come up which says this...

bug

in cpu , run extract_single_partiton, dask workers will raise CudaRunTimeError ,could someone answer me my code: def main(): if get_all_files_paths_under(config.cleaned_output_dir): logger.warning("Files is already exists, skipping...") sys.exit(0) with get_client(scheduler_address=config.cpu_scheduler_address) as client:...

bug

## Description Fix typo. `langauges` should be `languages`. ## Checklist - [x] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [x] New or Existing tests cover these changes. - [x]...

## Description Fixed URL for Fasttext language identification model download - [✓ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [✓] New or Existing tests cover these changes. -...

This tutorial demonstrates how to use NVIDIA's NeMo Curator library to modify text data containing Personally Identifiable Information (PII) using large language models (LLMs). We'll explore both asynchronous and synchronous...

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...