RedPajama-Data issues

Expected finish time for processing one single index of commoncrawl?

3

One more question, please. using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06...

kimcando

EOFError: Compressed file ended before the end-of-stream marker was reached

2

Hi, thank you in advance. I am facing with following error while using same command for processing commoncrawl in README. `python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l...

kimcando

how much disk memory will be used？

3

hello, there. i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size

newbietuan

Other language data

4

Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages,...

Dzg0309

slow transfer speeds from URL sources

5

I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and...

axelmagn

Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1

1

Is 1T version basically V1? If so, is the HF version of V1 (1T) already deduplicated are ready to be used?

konradipipan

Inconsistent IDs lead to distributed computing woes.

1

When trying to work with these data via Dataflow, I noticed a few things: - the ID field key is inconsistent between files. it is `id` in minhash and signals,...

axelmagn

Spanish artifact building error

2

When running the 'run_prep_artifacts.sh' script for 'es' there is an error when getting the wikipedia dataset. Hugginface does not have a prebuilt dataset for spanish and when line 53 fails...

hicotton02

Update README.md

remove flags in the code snippet in step 2

mauriceweber

Step 2) "Invalid option: ---input_base_uri"

1

``` bash scripts/apptainer_run_quality_signals.sh \ --config configs/rp_v2.0.conf \ --dump_id "2022-49" \ --input_base_uri "file:///path/to/data/root" \ --output_base_uri "file:///path/to/outout/data/root" \ --max_docs -1 ``` Invalid option: ---input_base_uri Usage: apptainer_run_quality_signals.sh [ -c | --config ] [...

timpal0l

RedPajama-Data
RedPajama-Data copied to clipboard

Metadata

Expected finish time for processing one single index of commoncrawl?

EOFError: Compressed file ended before the end-of-stream marker was reached

how much disk memory will be used？

Other language data

slow transfer speeds from URL sources

Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1

Inconsistent IDs lead to distributed computing woes.

Spanish artifact building error

Update README.md

Step 2) "Invalid option: ---input_base_uri"

← Metadata

Owner

Metadata

RedPajama-Data RedPajama-Data copied to clipboard

Metadata

← Metadata

Owner

Metadata

RedPajama-Data
RedPajama-Data copied to clipboard