PICK-pytorch icon indicating copy to clipboard operation
PICK-pytorch copied to clipboard

How to choose keys.txt as a vocabulary ? for example where can i find english vocab.

Open karimcossentini opened this issue 3 years ago • 7 comments

karimcossentini avatar Apr 21 '21 14:04 karimcossentini

You can iterate though all your training dataset and get a set of character, include the space. They encode by the character index, and vocab size in Embedding Layer equal to the len of keys. I think it is not good to compare with new method like BPE encoding

ducviet00 avatar Apr 23 '21 01:04 ducviet00

Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!

babyhockey avatar May 24 '21 14:05 babyhockey

Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?

karimcossentini avatar May 26 '21 11:05 karimcossentini

Yes, it's a list of English characters plus numbers and special characters.

On Wed, May 26, 2021, 7:34 AM karim cossentini @.***> wrote:

Hi babyhockey, it's a good question cuz I have a problem with this vocab file , did you find the answer ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848694771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK5YHDKKHV2CWUYY7UTTPTME5ANCNFSM43KOOAWA .

babyhockey avatar May 26 '21 11:05 babyhockey

The default vocab file (keys.txt) in this repo is in chinese , I translated it and I noticed that it contains not only characters , but sentences and words etc... so I did not understand what actually this file is

karimcossentini avatar May 26 '21 11:05 karimcossentini

As far as I can tell, the default file contains a list of Chinese characters.

On Wed, May 26, 2021, 7:42 AM karim cossentini @.***> wrote:

The default vocab file in this repo is in chineese , I translated it and I noticed that it contains not only characters , but sentences etc... so I did not understand what actually this file is

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenwenyu/PICK-pytorch/issues/88#issuecomment-848699005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQORHK2DCGHMA7RSCBQBK5DTPTNB7ANCNFSM43KOOAWA .

babyhockey avatar May 26 '21 11:05 babyhockey

Hi ducviet00, is the vocabulary a set of characters or words? If it is a set of characters, does it mean we just list "abcd...zABCD...Z" plus numbers and special chars in keys.txt? Thanks!

for custom dataset how to write keys.txt file

arunmack789 avatar May 02 '22 12:05 arunmack789