kgt5 Tokenizer questions

Tokenizer questions

Open screemix opened this issue 1 year ago • 1 comments

In the paper you explicitly mentioned that you trained a BPE tokenizer for your experiments:

However, in the code of the dataset.py you used T5TokenizerFast that is based on Unigram:

Moreover, you used pertained tokenizer in the code:

Could you please clarify which tokenizer configurations were used in your experiments for their reproducibility?

And also, could you please specify the vocabulary size for WN18RR, FB15k-237, and YAGO3-10 as there is no info about these datasets in the paper?

Apr 11 '23 11:04 screemix

Hi @screemix , thanks for your interest!

The code in main branch is old and is not the one used for final results - please see code in branch apoorv-dump for that. There, we used custom tokenizer trained using SentencePiece library with BPE
Unfortunately we do not have vocab sizes for those datasets (we did not keep a record, and the servers on which training was done are no longer accessible to me). However my best guess is that vocab size for WN18RR abd FB15k-237 was around 10k tokens (larger no. of tokens threw some kind of BPE issue), and around 25k-30k for YAGO3-10

Apr 15 '23 10:04 apoorvumang