kgt5
kgt5 copied to clipboard
Tokenizer questions
In the paper you explicitly mentioned that you trained a BPE tokenizer for your experiments:
However, in the code of the dataset.py you used T5TokenizerFast that is based on Unigram:
Moreover, you used pertained tokenizer in the code:

Could you please clarify which tokenizer configurations were used in your experiments for their reproducibility?
And also, could you please specify the vocabulary size for WN18RR, FB15k-237, and YAGO3-10 as there is no info about these datasets in the paper?
Hi @screemix , thanks for your interest!
- The code in main branch is old and is not the one used for final results - please see code in branch apoorv-dump for that. There, we used custom tokenizer trained using SentencePiece library with BPE
- Unfortunately we do not have vocab sizes for those datasets (we did not keep a record, and the servers on which training was done are no longer accessible to me). However my best guess is that vocab size for WN18RR abd FB15k-237 was around 10k tokens (larger no. of tokens threw some kind of BPE issue), and around 25k-30k for YAGO3-10