TensorFlowASR icon indicating copy to clipboard operation
TensorFlowASR copied to clipboard

What I should do if I want to train a Japanese Model?

Open ymzlygw opened this issue 4 years ago • 3 comments

Hi, my question is that for english, the output of model is directly the index of char If I understand correctly,then it can map between char and sequence. And for japanese, what is the output of model? and how to create map between index and kanji of jp.

ymzlygw avatar Aug 23 '21 07:08 ymzlygw

I see the english_characters , what about japanese? And too get the japanese_characters, token_type using is 'char' or 'bpe'? ENGLISH_CHARACTERS = [a-z],

ymzlygw avatar Aug 24 '21 07:08 ymzlygw

@ymzlygw I think for Japanese, Korean, Chinese we should use subwords instead of characters. If you can define a vocabulary contains all characters of the language like in english then you can use character mode. As far as I know those languages have characters that are a combination of "some characters in alphabet" so I think it's quite a lot for you to define a characters vocabulary file.

nglehuy avatar Oct 10 '21 09:10 nglehuy

Hi, I tried to train a Chinese model and it seems not good, I followed the steps in Conformer the same way with English. can have a suggestion on how could I properly train a Chinese model? Thanks!

psyma avatar Feb 16 '22 13:02 psyma