datatrove issues

Supporting Apache Beam

Having the support to run the core row-level components with Apache Beam could be extremely beneficial as: * Apache Beam is quite widely used in the community and has a...

sayakpaul

enhancement

Filter very slow

6

I using 4xH100, 100 CPU cores, 1000 RAM to filter 1TB data japanese. Although the GPU is at 50% utilization and the CPU is running at 100%, only 3MB of...

hiennm15

example for decontamination

Hi, could you add an example to show how to use the decontamination pipeline? Thanks

jordane95

Add expand_metadata Option to JsonlWriter

Added the expand_metadata option to JsonlWriter, available in HuggingfaceWriter and ParquetWriter. This enables consistent metadata handling across different writer types.

justHungryMan

Clarification Needed on 'Tasks' Terminology in Datatrove

Hi everyone, I've recently started using Datatrove for one of my personal projects and have been going through the documentation to understand it better. However, I'm having trouble understanding what...

eltonjohnfanboy

How to look into the processed data?

3

Hi, After running `tokenize_from_hf_to_s3.py`, I would like to inspect the resulting data. But I find that the current data is in a binary file (`.ds`). is there a way to...

shizhediao

Incorrect Job ID Extraction on Clusters with Custom Slurm Output

2

**Description:** When running a '**SlurmPipelineExecutor**' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a...

StephenRebelSSC

Implement zstd Compression Support for JSONL and Parquet Files

2

Support for zstd compression in both JSONL and Parquet file formats. Parquet Files: - The implementation applies compression directly within the internal write function (pq.ParquetWriter) using the compression option. -...

justHungryMan

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

31

Hello, Datatrove enthusiasts, Nice to meet you all. Recently, I've been working on the Datatrove library and I'm trying to run a sample script, `process_common_crawl_dump.py` from the following link: [Datatrove...

barneylogo

Optimize parquet output for remote reading

2

TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help...

H-Plus-Time

datatrove
datatrove copied to clipboard

Metadata

Supporting Apache Beam

Filter very slow

example for decontamination

Add expand_metadata Option to JsonlWriter

Clarification Needed on 'Tasks' Terminology in Datatrove

How to look into the processed data?

Incorrect Job ID Extraction on Clusters with Custom Slurm Output

Implement zstd Compression Support for JSONL and Parquet Files

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Optimize parquet output for remote reading

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard