MeloTTS Request for PR review: Add support for Thai language

I have created the following PR to add support for Thai language. I am in the process of creating a dataset to train the model but would love a PR review of the code first to make sure I am on the right track.

Thank you!

#117

Apr 30 '24 05:04 jadechip

Great job! I planed to work on training a Thai TTS model using MeloTTS too.

Apr 30 '24 13:04 tchayintr

Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai

May 01 '24 01:05 Zengyi-Qin

@Zengyi-Qin Sounds good, will report back once I have proper training results.

May 01 '24 07:05 jadechip

Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it!

May 01 '24 07:05 jadechip

@jadechip Sure! There are several datasets such as TSync2, Lotus, etc. You can check several of them here: https://github.com/korakot/corpus/releases/tag/v1.0 with documentation at https://lexitron.nectec.or.th/KM_HL5001/file_HL5001/Document/krrn_14518.pdf.

There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus.

However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄.

I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified. I will clarify and let you know if there is a point that may need to be adjusted in terms of Thai linguistic knowledge.

May 01 '24 08:05 tchayintr

@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏 I was also looking at this other nectec dataset: https://github.com/vistec-AI/dataset-releases/releases/tag/v1 I'll work on creating transcriptions next and report back.

May 01 '24 13:05 jadechip

@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error:

output

⚡ add-thai ~/MeloTTS/melo torchrun --nproc_per_node=1 --master_port=10902 train.py --c data/thai/config.json --model thai
2024-05-07 15:24:58.152 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141910/141910 [00:04<00:00, 32864.77it/s]
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:84 - min: 65; max: 987
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:85 - skipped: 327, total: 141910
buckets: [92994, 31326, 11604, 4350, 1068, 156, 84, 24]
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-05-07 15:25:02.699 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32832.13it/s]
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:84 - min: 164; max: 625
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:01<?, ?it/s]
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 194, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 98, in get_audio_text_speaker_pair
    bert, ja_bert, phones, tone, language = self.get_text(
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 180, in get_text
    raise
RuntimeError: No active exception to reraise

...it seems to happen around line 200 in train.py

config.json

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 6,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "data/thai/train.list",
    "validation_files": "data/thai/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "TH-default": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 9,
  "num_tones": 17,
  "symbols": [
    "_",
    "\"",
    "(",
    ")",
    "*",
    "/",
    ":",
    "AA",
    "E",
    "EE",
    "En",
    "N",
    "OO",
    "Q",
    "V",
    "[",
    "\\",
    "]",
    "^",
    "a",
    "a:",
    "aa",
    "ae",
    "ah",
    "ai",
    "an",
    "ang",
    "ao",
    "aw",
    "ay",
    "b",
    "by",
    "c",
    "ch",
    "d",
    "dh",
    "dy",
    "e",
    "e:",
    "eh",
    "ei",
    "en",
    "eng",
    "er",
    "ey",
    "f",
    "g",
    "gy",
    "h",
    "hh",
    "hy",
    "i",
    "i0",
    "i:",
    "ia",
    "ian",
    "iang",
    "iao",
    "ie",
    "ih",
    "in",
    "ing",
    "iong",
    "ir",
    "iu",
    "iy",
    "j",
    "jh",
    "k",
    "ky",
    "l",
    "m",
    "my",
    "n",
    "ng",
    "ny",
    "o",
    "o:",
    "ong",
    "ou",
    "ow",
    "oy",
    "p",
    "py",
    "q",
    "r",
    "ry",
    "s",
    "sh",
    "t",
    "th",
    "ts",
    "ty",
    "u",
    "u:",
    "ua",
    "uai",
    "uan",
    "uang",
    "uh",
    "ui",
    "un",
    "uo",
    "uw",
    "v",
    "van",
    "ve",
    "vn",
    "w",
    "x",
    "y",
    "z",
    "zh",
    "zy",
    "~",
    "æ",
    "ç",
    "ð",
    "ø",
    "ŋ",
    "œ",
    "ɐ",
    "ɑ",
    "ɒ",
    "ɔ",
    "ɕ",
    "ə",
    "ɛ",
    "ɜ",
    "ɡ",
    "ɣ",
    "ɥ",
    "ɦ",
    "ɪ",
    "ɫ",
    "ɬ",
    "ɭ",
    "ɯ",
    "ɲ",
    "ɵ",
    "ɸ",
    "ɹ",
    "ɾ",
    "ʁ",
    "ʃ",
    "ʊ",
    "ʌ",
    "ʎ",
    "ʏ",
    "ʑ",
    "ʒ",
    "ʝ",
    "ʲ",
    "ˈ",
    "ˌ",
    "ː",
    "̃",
    "̩",
    "β",
    "θ",
    "ก",
    "ข",
    "ฃ",
    "ค",
    "ฅ",
    "ฆ",
    "ง",
    "จ",
    "ฉ",
    "ช",
    "ซ",
    "ฌ",
    "ญ",
    "ฎ",
    "ฏ",
    "ฐ",
    "ฑ",
    "ฒ",
    "ณ",
    "ด",
    "ต",
    "ถ",
    "ท",
    "ธ",
    "น",
    "บ",
    "ป",
    "ผ",
    "ฝ",
    "พ",
    "ฟ",
    "ภ",
    "ม",
    "ย",
    "ร",
    "ล",
    "ว",
    "ศ",
    "ษ",
    "ส",
    "ห",
    "ฬ",
    "อ",
    "ฮ",
    "ะ",
    "ั",
    "า",
    "ำ",
    "ิ",
    "ี",
    "ึ",
    "ื",
    "ุ",
    "ู",
    "เ",
    "แ",
    "โ",
    "ใ",
    "ไ",
    "ๅ",
    "็",
    "่",
    "้",
    "์",
    "๐",
    "๑",
    "๒",
    "๓",
    "๔",
    "๕",
    "๖",
    "๗",
    "๘",
    "๙",
    "ᄀ",
    "ᄁ",
    "ᄂ",
    "ᄃ",
    "ᄄ",
    "ᄅ",
    "ᄆ",
    "ᄇ",
    "ᄈ",
    "ᄉ",
    "ᄊ",
    "ᄋ",
    "ᄌ",
    "ᄍ",
    "ᄎ",
    "ᄏ",
    "ᄐ",
    "ᄑ",
    "ᄒ",
    "ᅡ",
    "ᅢ",
    "ᅣ",
    "ᅤ",
    "ᅥ",
    "ᅦ",
    "ᅧ",
    "ᅨ",
    "ᅩ",
    "ᅪ",
    "ᅫ",
    "ᅬ",
    "ᅭ",
    "ᅮ",
    "ᅯ",
    "ᅰ",
    "ᅱ",
    "ᅲ",
    "ᅳ",
    "ᅴ",
    "ᅵ",
    "ᆨ",
    "ᆫ",
    "ᆮ",
    "ᆯ",
    "ᆷ",
    "ᆸ",
    "ᆼ",
    "ㄸ",
    "!",
    "?",
    "…",
    ",",
    ".",
    "'",
    "-",
    "¿",
    "¡",
    "SP",
    "UNK"
  ]
}

May 07 '24 15:05 jadechip

Nevermind, I was able to pinpoint the issue, I didn't realize you needed to add the language code here as well:

I've updated my PR with the missing code. I seems like it is training correctly now although I am still getting some warnings/exceptions:

Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:00<?, ?it/s][W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [99168, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Evaluating ...
Evauate done
  0%|▍                                                                                                                                           | 74/23601 [03:24<11:00:36,  1.68s/it]min value is  tensor(-1.1265)

Will try to run the complete training loop on some H100s 🤞

May 08 '24 06:05 jadechip

hello @jadechip let me know if its working

i'm training for indonesia and malay language

changing phonem and bert also

after 10 epoch the model doesnt produce any good word, only some noise , some random vowel

my data ~200hours dataset ~500 speaker

May 08 '24 08:05 acul3

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

May 08 '24 15:05 jeremy110

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

Thank you @jeremy110. If I understand correctly in melo/models.py, we should first initialize the TextEncoder with the original 219, in order to use the retrained weights, like this:

// models.py
        self.enc_p = TextEncoder(
            219,  # Initialize with the original symbol size
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
            gin_channels=self.enc_gin_channels,
            num_languages=num_languages,
            num_tones=num_tones,
        )

...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?

if n_vocab != 219:
    old_embeddings = self.enc_p.emb
    new_num_tokens = n_vocab
    self.enc_p.emb = self.get_resized_embeddings(old_embeddings, new_num_tokens)

Does that look correct to you? Note: I've updated my PR to reflect this.

May 09 '24 06:05 jadechip

hello~ @jadechip

Yes, it looks fine as it is.

However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here

May 09 '24 07:05 jeremy110

I see, thank you for the heads up @jeremy110 🙏 I've updated my code to reflect your suggestion, now I have.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + th_symbols
sil_phonemes_ids = [symbols.index(i) for i in pu_symbols]

# combine all tones
num_tones = num_zh_tones + num_ja_tones + num_en_tones + num_kr_tones + num_es_tones + num_fr_tones + num_de_tones + num_ru_tones + num_th_tones

# language maps
language_id_map = {"ZH": 0, "JP": 1, "EN": 2, "ZH_MIX_EN": 3, 'KR': 4, 'ES': 5, 'SP': 5, 'FR': 6, 'TH': 7}
num_languages = len(language_id_map.keys())

I'll try running a new training job to evaluate performance with these changes.

May 09 '24 08:05 jadechip

thanks @jadechip and @jeremy110

i'll try it to my environment also,see if works

May 09 '24 11:05 acul3

Ok, I was able to run a training job for around 9k steps yesterday. I tried running inference using the new checkpoint, but it seems to produce unintelligible sounds. I think the learning rate looks ok though? ...so I will try ramping up the batch size and training for longer on multiple GPUs and report back with my results 🤞 For reference here is my current config and Tensorboard metrics.

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "../Data/locutor/train.list",
    "validation_files": "../Data/locutor/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "locutor": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 1,
  "num_tones": 16,
  "symbols": [
...

May 11 '24 05:05 jadechip

btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz. Edit: Happy weekend everyone 🎉

May 11 '24 05:05 jadechip

hello~ @jadechip

My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code.

grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500)

Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing)

Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference.

I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person.

I hope this is helpful for you.

May 11 '24 12:05 jeremy110

Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏

May 12 '24 06:05 jadechip

Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code.

May 15 '24 05:05 jadechip

yeah me too

longer training,,the voice is clearer and similar, but cant pronounce a single word

maybe phenomizer problem ,idk

May 15 '24 06:05 acul3

hello @jadechip @acul3 I'd like to confirm something. Are all your tones set to 0? Because I made a similar mistake before where I treated tones like ˧ ˦ as phones, but they should correspond to tones. Here's an example of what I did before.

#error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1]

#correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1]

May 15 '24 09:05 jeremy110

@jeremy110 yes all my tone are set to 0

now wondering how can i fix this

May 15 '24 17:05 acul3

hello~ @acul3 @jadechip Sorry, I spent some time looking at that, but since I can't read Thai, I did some online research. I wanted to ask about the symbols from line 266 to 339 in the th_symbols . Are those symbols not IPA?

Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list.

But I'm confused about lines 5908 to 5910. Which one is correct?

May 16 '24 01:05 jeremy110

@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list. I've pushed some changes to the g2p function which hopefully addresses this:

def g2p(norm_text):
    tokenized = tokenizer.tokenize(norm_text)
    phs = []
    word2ph = []
    current_word = []
    current_phonemes = []

    for token in tokenized:
        if token.startswith("▁"):  # Start of a new word
            if current_word:
                word_phonemes = " ".join(current_phonemes)
                phs.extend(word_phonemes.split())
                word2ph.append(len(current_phonemes))
                current_word = []
                current_phonemes = []
            current_word.append(token.replace("▁", ""))
        else:
            current_word.append(token)

        if token in punctuation or token in pu_symbols:
            phs.append(token)
            word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(token.replace("▁", ""))
            current_phonemes.extend(phonemes.split())

    if current_word:
        word_phonemes = " ".join(current_phonemes)
        phs.extend(word_phonemes.split())
        word2ph.append(len(current_phonemes))

    # Distribute phonemes to match the number of tokens
    distributed_word2ph = []
    for i, group in enumerate(tokenized):
        if group.startswith("▁"):
            group = group.replace("▁", "")
        if group in punctuation or group in pu_symbols:
            distributed_word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(group)
            distributed_word2ph.append(len(phonemes.split()))

    tone_markers = ['˥', '˦', '˧', '˨', '˩']
    phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"]  # Remove tone markers from phones
    tones = extract_tones(phs)  # Extract tones from the original phs list
    word2ph = [1] + distributed_word2ph + [1]

    assert len(word2ph) == len(tokenized) + 2

    return phones, tones, word2ph


def extract_tones(phones):
    tones = []
    tone_map = {
        "˥": 5,  # High tone
        "˦": 4,  # Rising tone
        "˧": 3,  # Mid tone
        "˨": 2,  # Falling tone
        "˩": 1,  # Low tone
    }

    for phone in phones:
        tone = 0
        for marker, value in tone_map.items():
            if marker in phone:
                tone = value
                break
        tones.append(tone)

    return tones

TLDR;

it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.

...I've also updated the test following test case:

def test_g2p():
    text = "ฉันรักเมืองไทย"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)
    assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
    assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
    assert word2ph == [1, 0, 8, 12, 1]

I think this output makes sense as the output is now similar to yours.

The phones list contains the phonemes corresponding to the input text, excluding the tone markers. The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).

The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:

1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token

May 16 '24 07:05 jadechip

About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers. The remaining lines (340 - 406) were characters that I copied from the Wiktionary file (which I got from here https://github.com/PyThaiNLP/thai-g2p-wiktionary-corpus/tree/main), I am not sure if I should include them in this file (symbols.py) but if I remember correctly I was getting an error if I didn't include them.

May 16 '24 07:05 jadechip

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

May 16 '24 07:05 jadechip

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

https://github.com/wannaphong/thai-grapheme-to-phoneme
https://github.com/nozomiyamada/thaig2p
https://github.com/wannaphong/thai-g2p
https://www.thaicorpus.net/g2p

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

May 16 '24 07:05 tchayintr

text: 禮          數
ipa: l e ˥ ˧      s ɔ ˨˩
phones: ['_', 'l', 'e',     's', 'ɔ', '_']
tones: [0, 2, 2,       3, 3, 0]
word2ph: [1, 2,      2, 1]

Perhaps I misled you a bit. Let me clarify using an example. For '˥ ˧' in my case, it corresponds to 2. Then, with two phones, 'l' and 'e', so the tones correspond to two 2. For '˩' in my case, it corresponds to 3. Then, with two phones, 's' and 'ɔ', so the tones correspond to two 3.

May 16 '24 08:05 jeremy110

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

https://github.com/wannaphong/thai-grapheme-to-phoneme

https://github.com/nozomiyamada/thaig2p

https://github.com/wannaphong/thai-g2p

https://www.thaicorpus.net/g2p

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

Because I don't know Thai at all, I can't help with the g2p part. sorry

May 16 '24 08:05 jeremy110

@jeremy110 Don't worry, this is not your fault at all!

We are here to discuss and find a solution.

I will keep you updated if I got something. @jadechip @jeremy110

May 16 '24 09:05 tchayintr

MeloTTS MeloTTS copied to clipboard

Request for PR review: Add support for Thai language

MeloTTS
MeloTTS copied to clipboard