llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

Train/val split

Open DavidHerel opened this issue 1 year ago • 0 comments

Hi,

I want to ask how one can split a dataset to train/val splits. In the tinystories.py I don't quite understand the comment:

train/test split. let's use only shard 0 for test split, rest train

So how many tokens from train data are selected to be validation split?

It seems that @karpathy uses 10shards and if only 0 shard is used as a test split then it means that 1/10 of the data is used as a test set? e.g. if I have dataset with 10B tokens then 1B tokens are used for test/val set?

DavidHerel avatar Feb 06 '24 11:02 DavidHerel