gpt-2 icon indicating copy to clipboard operation
gpt-2 copied to clipboard

confused about vocab.bpe and encoder.json

Open weiguowilliam opened this issue 5 years ago • 2 comments

I'm reading the source code. And I have two questions about vocab and encoder. Please help me with that. Thank you in advance.

  1. For vocab, I take the second row (Ġ t) for example. But I found "Ġ" appears in many rows(for example the third row). So why isn't it one-to-one correspondence?
  2. Are the items in encoder.json the subtokens from BPE? I take "\u0120regress" for example. Why does "\u0120" appear here?

weiguowilliam avatar Sep 23 '19 19:09 weiguowilliam

@weiguowilliam see @80.

samsucik avatar Nov 03 '19 17:11 samsucik

how to generate custom vocab.bpe and encoder.json for different language and make it work ?

kishorekumar1505028 avatar Jun 02 '20 06:06 kishorekumar1505028