datatrove
datatrove copied to clipboard
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Having the support to run the core row-level components with Apache Beam could be extremely beneficial as: * Apache Beam is quite widely used in the community and has a...
I using 4xH100, 100 CPU cores, 1000 RAM to filter 1TB data japanese. Although the GPU is at 50% utilization and the CPU is running at 100%, only 3MB of...
Hi, could you add an example to show how to use the decontamination pipeline? Thanks
Added the expand_metadata option to JsonlWriter, available in HuggingfaceWriter and ParquetWriter. This enables consistent metadata handling across different writer types.
Hi everyone, I've recently started using Datatrove for one of my personal projects and have been going through the documentation to understand it better. However, I'm having trouble understanding what...
Hi, After running `tokenize_from_hf_to_s3.py`, I would like to inspect the resulting data. But I find that the current data is in a binary file (`.ds`). is there a way to...
**Description:** When running a '**SlurmPipelineExecutor**' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a...
Support for zstd compression in both JSONL and Parquet file formats. Parquet Files: - The implementation applies compression directly within the internal write function (pq.ParquetWriter) using the compression option. -...
Hello, Datatrove enthusiasts, Nice to meet you all. Recently, I've been working on the Datatrove library and I'm trying to run a sample script, `process_common_crawl_dump.py` from the following link: [Datatrove...
TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help...