nanoGPT
nanoGPT copied to clipboard
load generated text locally
I recently started working with nanoGPT(a week ago) and so far I am very satisfied with the results , however I would really like to load all of the generated text results in a file locally , because the results I get are "cut off" mid sentence. I would like to do some statistics and would love to get the same data size as my training dataset size. Anybody knows how ?
in the colab file there is this line of code:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
which generates x number of tokens depending on the x value you put as parameter. What I would like is to generate the "MAX" tokens and have my sentences complete and not cut off.
Two tricks required to achieve this:
- Train your model to emit <|endoftext|> when its done speaking, this is a special token at the very end of the gpt2 vocabulary (id=50256)
make sure in prepare.py you encode with special tokens allowed:
train_ids = enc.encode(train_data, allowed_special="all")
val_ids = enc.encode(val_data, allowed_special="all")
- update
generateinmodel.pyto use idx_next == 50256 as a break condition
@houda-w
While what @the-crypt-keeper said is correct, the issue of incomplete sentences can also be a different one. GPT-2 is an auto complete, It will just keep going. It fill finish a sentence, then make half of another sentence just to fulfill its requirements.
I had the same problem with my repo NanoChatGPT, where the model would generate the bot response, then continue making a whole human response too, just to fill its limits. What I did was consistently put my own <endOfText> token everywhere, and then use a partition to remove any text afterwards, and the results were pretty good