[How to create vocab.txt file]
Can you explain a method to build vocab.txt file ?
It's the same procedure as you would create a vocab file for BERT. E.g. you could use the excellent Tokenizers library from Hugging Face for that.
For my Turkish BERT model I documented the vocab generation steps, see it here.
After that, you can use this vocab for training a new model :)
thanks for your response. Can you give me format of your "tr_final" file? Or some sample like this?
I used a sentence-segmented training corpus. That means, each line of the input file contains one sentence :)
I used a sentence-segmented training corpus. That means, each line of the input file contains one sentence :)
Is that one sentence a line thing ok? I am asking because there is an option called: blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).
This sounds like you need blank lines to separate documents.