aitextgen
aitextgen copied to clipboard
generate line_by_line spits out startoftext and endoftext tokens into the output text directly
The description in the existing documentation says that it exists to process a single column CSV treating each line as a single entry.
So I used an input file which was simply a text file with a sentence on each line, which corresponds to the format of a single column csv:
some sample text example 1
some sample text example 2
However, when I ran generate
after finetuning (using the Colab notebook)
I got outputs that looked like:
<|startoftext|>some sample text example 1<|endoftext|>
I was under the impression that those special start and stop tokens were automatically handled. Was I using the library wrong in some way? All the training and the outputs were made in the provided fine-tuning Colab notebook.
Yes, something changed in either transformers or tokenizers. Will fix for 0.4.0
Can confirm something is messy with how transformers builds tokenizers which is leading to a discrepancy. This may be less easy than anticipated.
![Screen Shot 2021-02-15 at 8 54 00 PM](https://user-images.githubusercontent.com/2179708/108019923-16b87480-6fd0-11eb-86e7-b9c66da97bf2.png)
This is only the case for the fast tokenizers: slow tokenizers work fine. May fall back to that for the release.
It's acceptable for performance since decoding speed is not the hugest deal compared to encoding speed (which can still use Fast tokenizers as it's not affected by this issue)
...although current implementations like the Notebook use the quick trick to pass the tokenizer to TokenDataset for encoding. That may need to be refactored a bit to work around this issue.
Or, alternatively, filter out the eos/bos tokens at the PyTorch tensor level before decoding. That may be easier, and have other perks too.
Looking for guidance .. Trying to train and generate GPT-2 for short texts via aitextgen using CSV and failing. ( longform content generation works fine) Still getting a long text answer with special tokens embedded. Looking for function similar to how model="minimaxir/hacker-news" functions. But trained using aitexgen (not simple-gpt2). Is this the same issue as above?
@channelz probably, yes.
No; the issue is different than gpt-2-simple (that uses post-processing, which is what the solution here will require). With gpt-2-simple you need to use the truncate
params.
0.4.0 is out, which should have fixed this particular weirdness. Let me know if the issue persists.