language-resources icon indicating copy to clipboard operation
language-resources copied to clipboard

How to create pronunciation lexicon for Bengali?

Open Rajan-sust opened this issue 5 years ago • 6 comments

For creating a pronunciation of a word, we have to do two task (phoneme finding and splitting into syllable). I think spitting into syllable is a big deal. Expected format [1]. How can I do it programmatically?

[1] https://github.com/google/language-resources/blob/master/bn/data/lexicon.tsv

Rajan-sust avatar May 29 '19 09:05 Rajan-sust

Do you mean a program that can take in arbitrary words and output the transcription for that?

pasindud avatar May 29 '19 18:05 pasindud

Yes, how can I do it?

Rajan-sust avatar Jun 02 '19 07:06 Rajan-sust

The quick answer is no.

pasindud avatar Jun 02 '19 19:06 pasindud

Hope you will share if u find an idea.

Rajan-sust avatar Jun 03 '19 05:06 Rajan-sust

We merged lexicon words from [1] and [2]. The total number of unique lexicon is 64969. 4443 unique words of our corpus do not exist in merged lexicon. What can be the best procedure for transcribing 4443 words to lexicon?

[1] https://github.com/google/language-resources/blob/master/bn/data/lexicon.tsv [2] https://github.com/google/language-resources/blob/master/bn/festvox/lexicon.scm

Rajan-sust avatar Jun 11 '19 15:06 Rajan-sust

Note that [2] is generated from [1]. Only difference is the file types.

  • This conversation is done by running

    cat bn/data/lexicon.tsv | python festival_utils/festival_lexicon_from_tsv.py > bn/festvox/lexicon.scm
    

The transcription guide can be found at [3]

[1] https://github.com/google/language-resources/blob/master/bn/data/lexicon.tsv [2] https://github.com/google/language-resources/blob/master/bn/festvox/lexicon.scm [3] https://github.com/google/language-resources/blob/master/bn/transcription.md

pasindud avatar Jun 11 '19 20:06 pasindud