aitextgen generate line_by_line spits out startoftext and endoftext tokens into the output text directly

generate line_by_line spits out startoftext and endoftext tokens into the output text directly

Open redthing1 opened this issue 3 years ago • 8 comments

The description in the existing documentation says that it exists to process a single column CSV treating each line as a single entry.

So I used an input file which was simply a text file with a sentence on each line, which corresponds to the format of a single column csv:

some sample text example 1
some sample text example 2

However, when I ran generate after finetuning (using the Colab notebook)

I got outputs that looked like:

<|startoftext|>some sample text example 1<|endoftext|>

I was under the impression that those special start and stop tokens were automatically handled. Was I using the library wrong in some way? All the training and the outputs were made in the provided fine-tuning Colab notebook.

Feb 06 '21 07:02 redthing1

Yes, something changed in either transformers or tokenizers. Will fix for 0.4.0

Feb 11 '21 03:02 minimaxir

Can confirm something is messy with how transformers builds tokenizers which is leading to a discrepancy. This may be less easy than anticipated.

Feb 16 '21 04:02 minimaxir

This is only the case for the fast tokenizers: slow tokenizers work fine. May fall back to that for the release.

It's acceptable for performance since decoding speed is not the hugest deal compared to encoding speed (which can still use Fast tokenizers as it's not affected by this issue)

...although current implementations like the Notebook use the quick trick to pass the tokenizer to TokenDataset for encoding. That may need to be refactored a bit to work around this issue.

Feb 16 '21 05:02 minimaxir

Or, alternatively, filter out the eos/bos tokens at the PyTorch tensor level before decoding. That may be easier, and have other perks too.

Feb 16 '21 05:02 minimaxir

Looking for guidance .. Trying to train and generate GPT-2 for short texts via aitextgen using CSV and failing. ( longform content generation works fine) Still getting a long text answer with special tokens embedded. Looking for function similar to how model="minimaxir/hacker-news" functions. But trained using aitexgen (not simple-gpt2). Is this the same issue as above?

Feb 19 '21 22:02 channelz

@channelz probably, yes.

Feb 20 '21 04:02 redthing1

No; the issue is different than gpt-2-simple (that uses post-processing, which is what the solution here will require). With gpt-2-simple you need to use the truncate params.

Feb 21 '21 19:02 minimaxir

0.4.0 is out, which should have fixed this particular weirdness. Let me know if the issue persists.

Feb 23 '21 04:02 minimaxir

aitextgen aitextgen copied to clipboard

generate line_by_line spits out startoftext and endoftext tokens into the output text directly

aitextgen
aitextgen copied to clipboard