datatrove
datatrove copied to clipboard
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only...
### Problem We currently use first x bytes sha1 for hashing, which is a waste of resources. - we don't need cryptographic guarantees - we only take first x bytes...
Hello, I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in...
Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit...
Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say...
I'm wondering if it is possible to add support for other popular large-scale data processing frameworks like spark, since most operations are compatible with the map operation in spark. This...
Ray (https://github.com/ray-project/ray) becomes popular choice of running distributed Python ML applications. Its Python interface is easy to scale up the workload from local laptop to distributed cluster. It would be...
Current parallel strategy assign different files in a directory to different workers. There are many situations where this may incur load unbalancing, for example, when the input files are irregular...
Hi, Thanks for your efforts in open-sourcing such an awesome library for large-scale data processing. I'm trying to reproduce the RefinedWeb dataset. I find that the pipeline for processing common...