nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids?

Open mw66 opened this issue 9 months ago • 3 comments

Hi,

I'm thinking about adding some special END OF TEXT token to my data (to separate different articles), e.g:

https://github.com/karpathy/nanoGPT/issues/244

I checked here:

https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/data/shakespeare_char/prepare.py#L51

and I'm wondering if the training data (train_ids) has to be consecutive?

E.g. can I use np.iinfo(np.uint16).max as my special marker token? to avoid conflict with any new future dataset as much as possible.

Thanks.

mw66 avatar Sep 18 '23 23:09 mw66