nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

My own tokenizer

Open spcrobocar opened this issue 1 year ago • 1 comments
trafficstars

I am working on using NanoGPT to solve a geometry problem. I would like to use the gpt2 network structure but my own tokenizer. My vocabulary size is 1500. I have my own encode/decode code to convert my data into uint16 array. I am currently using the config/train_gpt2.py configuration file. When I started the training, I saw it print out something like "Defaulting to vocab_size of GPT2 to 50000". I do not need such a large vocabulary size. How can I change the config file to use my own tokenizer and vocubularty?

spcrobocar avatar Jan 16 '24 01:01 spcrobocar

I believe the Nanogpt supports meta.pkl or meta pickle files for encodings, you could train one with sentence piece.

VatsaDev avatar Jan 22 '24 21:01 VatsaDev