nanoGPT load generated text locally

I recently started working with nanoGPT(a week ago) and so far I am very satisfied with the results , however I would really like to load all of the generated text results in a file locally , because the results I get are "cut off" mid sentence. I would like to do some statistics and would love to get the same data size as my training dataset size. Anybody knows how ?

in the colab file there is this line of code:

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

which generates x number of tokens depending on the x value you put as parameter. What I would like is to generate the "MAX" tokens and have my sentences complete and not cut off.

Jun 19 '23 12:06 houda-w

Two tricks required to achieve this:

Train your model to emit <|endoftext|> when its done speaking, this is a special token at the very end of the gpt2 vocabulary (id=50256)

make sure in prepare.py you encode with special tokens allowed:

train_ids = enc.encode(train_data, allowed_special="all")
val_ids = enc.encode(val_data, allowed_special="all")

update generate in model.py to use idx_next == 50256 as a break condition

Jun 29 '23 22:06 the-crypt-keeper

@houda-w

While what @the-crypt-keeper said is correct, the issue of incomplete sentences can also be a different one. GPT-2 is an auto complete, It will just keep going. It fill finish a sentence, then make half of another sentence just to fulfill its requirements.

I had the same problem with my repo NanoChatGPT, where the model would generate the bot response, then continue making a whole human response too, just to fill its limits. What I did was consistently put my own <endOfText> token everywhere, and then use a partition to remove any text afterwards, and the results were pretty good

Aug 24 '23 03:08 VatsaDev

nanoGPT nanoGPT copied to clipboard

load generated text locally

nanoGPT
nanoGPT copied to clipboard