NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

- [x] Raise a clear and readable error if the Dask client is not GPU-based - [x] Raise a clear and readable error if the dataset’s backend is not cuDF

gpuci

Closes https://github.com/NVIDIA/NeMo-Curator/issues/376.

gpuci

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...

updates: - [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0) - [github.com/psf/black: 24.4.2 → 25.1.0](https://github.com/psf/black/compare/24.4.2...25.1.0) - [github.com/PyCQA/isort: 5.13.2 → 6.0.1](https://github.com/PyCQA/isort/compare/5.13.2...6.0.1)

As part of https://github.com/NVIDIA/NeMo-Curator/issues/335 we did some investigation to understand how we can improve our performance, and we came up with a simple broadcast merge to perform the left-anti join....

**Description** We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should: 1. Perform sampling before clustering. The...

enhancement
jira

Currently the error is raised at the dask layer which is not helpful the user ``` \\\\\\"/opt/NeMo-Text-Curator/nemo_curator/datasets/doc_dataset.py\\\\\\", line 220, in read_custom\\\\n read_data(\\\\n File \\\\\\"/opt/NeMo-Text-Curator/nemo_curator/utils/distributed_utils.py\\\\\\", line 604, in read_data\\\\n return read_data_files_per_partition(\\\\n...

Improve Perf by adding model compilation: - [ ] https://github.com/rapidsai/crossfit/issues/90 - [ ] [Possible Memory Estimation Issue Leading to OOMs and Restarts #72](https://github.com/rapidsai/crossfit/issues/72) - [ ] [Semantic dedup (uses Crossfit...

enhancement
jira

**Is your feature request related to a problem? Please describe.** When using a `cache_dir` in modules like `FuzzyDedup` if the user provides a `cache_dir` that was previously also used, if...

enhancement
jira

We have long strings support in cuDF now, so we can deprecate the `max_text_bytes_per_part` parameter. Related: - https://github.com/NVIDIA/NeMo-Curator/pull/77 - https://github.com/NVIDIA/NeMo-Curator/issues/233 - https://github.com/NVIDIA/NeMo-Curator/pull/314

enhancement
jira