RedPajama-Data
RedPajama-Data copied to clipboard
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
One more question, please. using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06...
Hi, thank you in advance. I am facing with following error while using same command for processing commoncrawl in README. `python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l...
hello, there. i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size
Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages,...
I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and...
Is 1T version basically V1? If so, is the HF version of V1 (1T) already deduplicated are ready to be used?
When trying to work with these data via Dataflow, I noticed a few things: - the ID field key is inconsistent between files. it is `id` in minhash and signals,...
When running the 'run_prep_artifacts.sh' script for 'es' there is an error when getting the wikipedia dataset. Hugginface does not have a prebuilt dataset for spanish and when line 53 fails...
remove flags in the code snippet in step 2
``` bash scripts/apptainer_run_quality_signals.sh \ --config configs/rp_v2.0.conf \ --dump_id "2022-49" \ --input_base_uri "file:///path/to/data/root" \ --output_base_uri "file:///path/to/outout/data/root" \ --max_docs -1 ``` Invalid option: ---input_base_uri Usage: apptainer_run_quality_signals.sh [ -c | --config ] [...