PICK-pytorch
PICK-pytorch copied to clipboard
How to choose keys.txt as a vocabulary ? for example where can i find english vocab.
You can iterate though all your training dataset and get a set of character, include the space. They encode by the character index, and vocab size in Embedding Layer equal to the len of keys. I think it is not good to compare with new method like BPE encoding
Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!
Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?
Yes, it's a list of English characters plus numbers and special characters.
On Wed, May 26, 2021, 7:34 AM karim cossentini @.***> wrote:
Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848694771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK5YHDKKHV2CWUYY7UTTPTME5ANCNFSM43KOOAWA .
The default vocab file (keys.txt) in this repo is in chinese , I translated it and I noticed that it contains not only characters , but sentences and words etc... so I did not understand what actually this file is
As far as I can tell, the default file contains a list of Chinese characters.
On Wed, May 26, 2021, 7:42 AM karim cossentini @.***> wrote:
The default vocab file in this repo is in chineese , I translated it and I noticed that it contains not only characters , but sentences etc... so I did not understand what actually this file is
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848699005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK2DCGHMA7RSCBQBK5DTPTNB7ANCNFSM43KOOAWA .
Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!
for custom dataset how to write keys.txt file