datatrove issues

Is there a plan to process data in sft format

1

- sft dataset support.

simplew2011

Sentence deduplication output

5

Hi, I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py. All I obtain are 'intermediate', 'sent_dups'...

solene-evain

Option to Not Upload to Hugging Face Hub in HuggingFaceDatasetWriter

2

I’ve been using HuggingFaceDatasetWriter and noticed that it seems to default to uploading to the hub when I intended to save locally only. Could we consider adding a parameter to...

justHungryMan

Add Dataset type parameters

https://github.com/huggingface/datatrove/blob/734990228d305bdd38c2c3bab4e697d988c9ae68/src/datatrove/pipeline/readers/huggingface.py#L94 How about adding Dataset type parameter? To handle the case of the dataset that is processed at runtime and passed as a Dataset object. 😀

aiqwe

no qos driving to invalid qos specification

6

Hi everyone, I want to do deduplication so, for now, I'm running tests using minhash_deduplication.py. I'm using a server where I need to add account and contraint info so I...

solene-evain

How about addding custom word_tokenizers?

How about addding custom word tokenizer class in `utis/word_tokenizers.py`? the reason is following: + I just want not to use determined tokenizer(in `word_tokenizers.WORD_TOKENIZER_FACTORY`) but other tokenizer(such as [khaiii](https://github.com/kakao/khaiii)). + Some...

aiqwe

enhancement

boto timeout when I read CC with Warc

1

Hi, I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when...

marcopasqua

Fix SENTINEL cluster

1

We do not want to store cluster_id for sentinel point, since it is not in the current data to process

jordane95

solved: how to launch a slurm executor from an interactive slurm job

I forget where I saw it in the docs/code where it said not to launch a slurm executor from an `srun` interactive session - which is not quite always possible....

stas00

Potential issue of dedup in index

Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange. ``` -rw-r--r-- 1 root root 108K Jul 12 12:40...

jordane95

datatrove
datatrove copied to clipboard

Metadata

Is there a plan to process data in sft format

Sentence deduplication output

Option to Not Upload to Hugging Face Hub in HuggingFaceDatasetWriter

Add Dataset type parameters

no qos driving to invalid qos specification

How about addding custom word_tokenizers?

boto timeout when I read CC with Warc

Fix SENTINEL cluster

solved: how to launch a slurm executor from an interactive slurm job

Potential issue of dedup in index

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard