datatrove issues

Summary stats

### This PR adds blocks for computing summary statistics - Added base block for creating the summary statistics - Added merging block for merging summary statistics computed in distributed manner....

hynky1999

URL dedup of two datasets

1

Hello, how can I use URL dedup to deup two datasets on the URL level. Basically I want to know what are the documents in dataset A that are not...

basma-b

Migrate word tokenizer download functions to process locked download

## Problem #187 Introduced new tokenizer libraries, which will often need to download several files to work. This can however introduce a problems as the downloads are not interlocked. ##...

hynky1999

Fastwarc reader

1

Can we add a new warc reader using the [fastwarc](https://resiliparse.chatnoir.eu/en/latest/man/fastwarc.html)? It is said to be much more [efficient](https://arxiv.org/abs/2112.03103) than warcio

jordane95

Speedup json writer

2

This pull request replaces the `json` standard library with a much more faster `orjson` library, speeding up JsonlWriter by about 5x on my machine, and about 2x for JsonlReader, roughly...

its5Q

Dependency resolving issue installing from source

4

Hi, I have an issue installing this library from source on Windows. I've cloned the repo, created a Python 3.11 venv and ran `pip install -e ".[dev]"`, but that basically...

its5Q

Periodical logging of stats

2

For long running tasks it would be useful to emit statistics every once in a while, let's say every 60s. It's frustrating to have to wait for the pipeline to...

rantav

OpensearchWriter

2

Similar to #165 I've implemented an `OpensearchWriter` in my project (which might be compatible with Elastic), would you like me to contribute it back to this project? Where in the...

rantav

PineconeWriter

I've implemented a `PineconeWriter` in my project, would you like me to contribute it back to this project? Where in the source tree would it reside in so? (perhaps create...

rantav

Support int32 in substring dedup

4

I'm using a tokenizer with > 100k vocab size, so the token id overflow as it is stored in uint16. I'm wondering if we can add support for int32? Is...

jordane95

datatrove
datatrove copied to clipboard

Metadata

Summary stats

URL dedup of two datasets

Migrate word tokenizer download functions to process locked download

Fastwarc reader

Speedup json writer

Dependency resolving issue installing from source

Periodical logging of stats

OpensearchWriter

PineconeWriter

Support int32 in substring dedup

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard