datatrove
datatrove copied to clipboard
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
- sft dataset support.
Hi, I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py. All I obtain are 'intermediate', 'sent_dups'...
I’ve been using HuggingFaceDatasetWriter and noticed that it seems to default to uploading to the hub when I intended to save locally only. Could we consider adding a parameter to...
https://github.com/huggingface/datatrove/blob/734990228d305bdd38c2c3bab4e697d988c9ae68/src/datatrove/pipeline/readers/huggingface.py#L94 How about adding Dataset type parameter? To handle the case of the dataset that is processed at runtime and passed as a Dataset object. 😀
Hi everyone, I want to do deduplication so, for now, I'm running tests using minhash_deduplication.py. I'm using a server where I need to add account and contraint info so I...
How about addding custom word tokenizer class in `utis/word_tokenizers.py`? the reason is following: + I just want not to use determined tokenizer(in `word_tokenizers.WORD_TOKENIZER_FACTORY`) but other tokenizer(such as [khaiii](https://github.com/kakao/khaiii)). + Some...
Hi, I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when...
We do not want to store cluster_id for sentinel point, since it is not in the current data to process
I forget where I saw it in the docs/code where it said not to launch a slurm executor from an `srun` interactive session - which is not quite always possible....
Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange. ``` -rw-r--r-- 1 root root 108K Jul 12 12:40...