litgpt processing the dataset.

I have an Arabic dataset of size 96GB that I want to use for pre-training litGPT. However, in the image provided [link to the image], it is mentioned that if the dataset is large, we should use litdata. But when I checked the README of litdata, there were no clear instructions on how to do it.

big_data

Here is the dataset I want to use: https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset

Thank you.

Jul 03 '24 07:07 Esmail-ibraheem

Good point. Does the LitData section here help? https://github.com/Lightning-AI/litdata?tab=readme-ov-file#1-prepare-your-data

Jul 05 '24 12:07 rasbt

no, I did not understand from the litdata, how I can convert or process my custom dataset so I can use it in litgpt:

Jul 06 '24 06:07 Esmail-ibraheem

Personally, I use the TextFiles approach that I've implemented in LitGPT. But going back to an earlier comment you had, (and the phrase in the docs), my colleagues don't recommend it for very large datasets since it starts with plain text files (rather than tokenized text), and plain text can be inefficient to store.

Personally, I don't have much experience with LitData, but If I ever prepare a large custom dataset, I'll amend the docs. In the meantime, the best way is perhaps to look at how its done for the prepare_slimpajama.py and prepare_starcoder.py in https://github.com/Lightning-AI/litgpt/tree/main/litgpt/data

which are used in the Pretrain TinyLlama. Thomas Chaton, who is the developer of LitData, also has a tutorial on the dataset prep here which could be helpful

Jul 06 '24 11:07 rasbt