LMOps icon indicating copy to clipboard operation
LMOps copied to clipboard

Questions about the free-law data used in the paper "Adapt LLM to domains"

Open WUHU-G opened this issue 1 year ago • 2 comments

Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?

WUHU-G avatar Feb 11 '24 03:02 WUHU-G

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:

https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train

I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:

from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')

cdxeve avatar Feb 11 '24 05:02 cdxeve

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:

https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train

I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:

from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')

Thank you very much for your reply. I'll try your method again

WUHU-G avatar Feb 11 '24 05:02 WUHU-G