datatrove issues

Memory overhead in multiprocessing

8

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only...

jordane95

Linewise filters

guipenedo

Migrate from sha1 to xxhash for deduplication methods

### Problem We currently use first x bytes sha1 for hashing, which is a waste of resources. - we don't need cryptographic guarantees - we only take first x bytes...

hynky1999

Enhancing word_tokenize (like nltk) Support for Multiple Languages

4

Hello, I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in...

justHungryMan

Potential issues in substring dedup

10

Hi @guipenedo , I used your substring dedup script to perform deduplication on a dump of cc and did some manual inspection. I find that some resulting duplicates a bit...

jordane95

bug

Flexibility in minhash dedup by index

8

Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say...

jordane95

Spark support

3

I'm wondering if it is possible to add support for other popular large-scale data processing frameworks like spark, since most operations are compatible with the map operation in spark. This...

jordane95

enhancement

Support Ray as executor

2

Ray (https://github.com/ray-project/ray) becomes popular choice of running distributed Python ML applications. Its Python interface is easy to scale up the workload from local laptop to distributed cluster. It would be...

c21

enhancement

In-file parallelism

1

Current parallel strategy assign different files in a directory to different workers. There are many situations where this may incur load unbalancing, for example, when the input files are irregular...

jordane95

enhancement

Implementation of line-wise corrections

1

Hi, Thanks for your efforts in open-sourcing such an awesome library for large-scale data processing. I'm trying to reproduce the RefinedWeb dataset. I find that the pipeline for processing common...

jordane95

enhancement

datatrove
datatrove copied to clipboard

Metadata

Memory overhead in multiprocessing

Linewise filters

Migrate from sha1 to xxhash for deduplication methods

Enhancing word_tokenize (like nltk) Support for Multiple Languages

Potential issues in substring dedup

Flexibility in minhash dedup by index

Spark support

Support Ray as executor

In-file parallelism

Implementation of line-wise corrections

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard