RedPajama-Data
RedPajama-Data copied to clipboard
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
I think there are 2 main problems in current `clean_copyright_comments` function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27. First, It cannot remove the copyright successfully in the following C-style code because of the early return in...
Hi, I'm trying to run this test case: `python3 -m cc_net --config config/test_segment.json` but encountered the following error: ``` Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('test_data2'), mined_dir='mined_by_segment',...
If the program terminates due to a power outage when I run the cc-net data prepare pipline, how can it continue execution at the breakpoint when restarted?
Hi there! I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples...
Thank you for your work! I am preprocessing for another language(zh). I have some questions regarding the provided instructions: > In `extracted_urls.txt`, we provide 38M URLs that are processed from...
hello there, thank for your good work. i want to download a small portion of cc(to run through the whole process firstly) when i run the code 'python -m cc_net...
1. install sentencepiece from github repo. I can not run the .zip version on my MacOS. 2. make some necessary directories during make 3. cache the wiki json.gz if has...
The common crawl data entries have a source like this: `"source":"cc/2023-06/en_head_0000.json.gz/line401859"` What's the right way to map that back to metadata where the entry came from? In particular I'd like...
## Current State Currently the data in the commoncrawl slice contains the following fields in addition to the `text` field: ``` "pred_label": "__label__cc", "pred_label_prob": XXX, "wiki_prob": XXX, "source": "cc/2019-30/en_middle_0053.json.gz/line1" ```...
In this PR, we can run `python -m cc_net --config config/test_segment.json` successfully in the following directory. data_prep/cc/cc_net/cc_net depends on #36