datatrove
datatrove copied to clipboard
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
### This PR adds blocks for computing summary statistics - Added base block for creating the summary statistics - Added merging block for merging summary statistics computed in distributed manner....
Hello, how can I use URL dedup to deup two datasets on the URL level. Basically I want to know what are the documents in dataset A that are not...
## Problem #187 Introduced new tokenizer libraries, which will often need to download several files to work. This can however introduce a problems as the downloads are not interlocked. ##...
Can we add a new warc reader using the [fastwarc](https://resiliparse.chatnoir.eu/en/latest/man/fastwarc.html)? It is said to be much more [efficient](https://arxiv.org/abs/2112.03103) than warcio
This pull request replaces the `json` standard library with a much more faster `orjson` library, speeding up JsonlWriter by about 5x on my machine, and about 2x for JsonlReader, roughly...
Hi, I have an issue installing this library from source on Windows. I've cloned the repo, created a Python 3.11 venv and ran `pip install -e ".[dev]"`, but that basically...
For long running tasks it would be useful to emit statistics every once in a while, let's say every 60s. It's frustrating to have to wait for the pipeline to...
Similar to #165 I've implemented an `OpensearchWriter` in my project (which might be compatible with Elastic), would you like me to contribute it back to this project? Where in the...
I've implemented a `PineconeWriter` in my project, would you like me to contribute it back to this project? Where in the source tree would it reside in so? (perhaps create...
I'm using a tokenizer with > 100k vocab size, so the token id overflow as it is stored in uint16. I'm wondering if we can add support for int32? Is...