datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Results 69 datatrove issues
Sort by recently updated
recently updated
newest added

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only...

### Problem We currently use first x bytes sha1 for hashing, which is a waste of resources. - we don't need cryptographic guarantees - we only take first x bytes...

Hello, I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in...

Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit...

bug

Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say...

I'm wondering if it is possible to add support for other popular large-scale data processing frameworks like spark, since most operations are compatible with the map operation in spark. This...

enhancement

Ray (https://github.com/ray-project/ray) becomes popular choice of running distributed Python ML applications. Its Python interface is easy to scale up the workload from local laptop to distributed cluster. It would be...

enhancement

Current parallel strategy assign different files in a directory to different workers. There are many situations where this may incur load unbalancing, for example, when the input files are irregular...

enhancement

Hi, Thanks for your efforts in open-sourcing such an awesome library for large-scale data processing. I'm trying to reproduce the RefinedWeb dataset. I find that the pipeline for processing common...

enhancement