nanoGPT
nanoGPT copied to clipboard
Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids?
Hi,
I'm thinking about adding some special END OF TEXT token to my data (to separate different articles), e.g:
https://github.com/karpathy/nanoGPT/issues/244
I checked here:
https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/data/shakespeare_char/prepare.py#L51
and I'm wondering if the training data (train_ids) has to be consecutive?
E.g. can I use np.iinfo(np.uint16).max
as my special marker token? to avoid conflict with any new future dataset as much as possible.
Thanks.