NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

**Describe the bug** Semantic Dedup often gets stuck at the state when we call `semantic_cluster_dedup.extract_dedup_data`. **Steps/Code to reproduce bug** Run semantic dedup when the `client = get_client(device_type='gpu', protocol='ucx')` **Environment overview**...

bug

## Description This PR adds - [x] Docstrings for all classes in the image curation - [ ] API docs for the docstrings - [ ] Pages in the user...

documentation

Closes https://github.com/NVIDIA/NeMo-Curator/issues/70 cc @VibhuJawa

documentation

…client ## Description This PR adds an example demonstrating the usage of the Kubernetes Python client to execute NeMo modules on the scheduler pod. ## Usage ```python Updated documentation. #...

documentation

**Describe the bug** I have been using NEMO Curator to extract data from Common Crawl using the function `from nemo_curator.download import download_common_crawl`. My target language is Thai, but after running...

bug

## Description This PR adds support for parallel data curation. Namely: - A new dataset class `ParallelDataset` that supports loading and writing parallel data in simple bitext format. - A...

As we have added support for HF model translation via CrossFit, we are working towards performance improvement with ctranslate2. This work depends on adding support for ctranslate2 in CrossFit, and...

enhancement
jira

**Describe the bug** If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails. Failure is observed here but originates from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py#L144 being...

bug

Is your feature request related to a problem? I am frustrated that the get_word_splitter function does not handle Japanese text correctly. For example, Japanese does not have spaces between words,...

enhancement

## Description changed model name from meta/llama3.1-405b-instruct and stg/meta/llama3.1-405b-instruct to meta/llama-3.1-405b-instruct. ## Usage ```python # Add snippet demonstrating usage async def generate_subtopics(client, topic, n_subtopics): prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(topic=topic, n_subtopics=n_subtopics) response =...