electra icon indicating copy to clipboard operation
electra copied to clipboard

[How to create vocab.txt file]

Open Vietdung113 opened this issue 5 years ago • 4 comments

Can you explain a method to build vocab.txt file ?

Vietdung113 avatar May 19 '20 01:05 Vietdung113

It's the same procedure as you would create a vocab file for BERT. E.g. you could use the excellent Tokenizers library from Hugging Face for that.

For my Turkish BERT model I documented the vocab generation steps, see it here.

After that, you can use this vocab for training a new model :)

stefan-it avatar May 19 '20 07:05 stefan-it

thanks for your response. Can you give me format of your "tr_final" file? Or some sample like this?

Vietdung113 avatar May 21 '20 02:05 Vietdung113

I used a sentence-segmented training corpus. That means, each line of the input file contains one sentence :)

stefan-it avatar May 22 '20 11:05 stefan-it

I used a sentence-segmented training corpus. That means, each line of the input file contains one sentence :)

Is that one sentence a line thing ok? I am asking because there is an option called: blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).

This sounds like you need blank lines to separate documents.

PhilipMay avatar Jul 17 '20 20:07 PhilipMay