SLED-TTS icon indicating copy to clipboard operation
SLED-TTS copied to clipboard

To train a multilingual model for multiple Indian languages

Open rose-jinyang opened this issue 7 months ago • 16 comments

Hello How are you? Thanks for contributing to this project. Is it possible to train a multilingual model for multiple Indian languages?

rose-jinyang avatar May 22 '25 12:05 rose-jinyang

@rose-jinyang Hi, thank you for your interest.

You can train a multilingual model as long as you have the right dataset; you only need to modify the following two functions in the training script to make it work with your data:

https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L124 https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L171

One potential issue is that the current code uses EnCodec to extract continuous latent representations. I’m not sure how well EnCodec supports Indian languages. If you want to use a different model to obtain the latent representations, you’ll also need to adjust the feature-extraction section in the model file:

https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/sled/sled.py#L127

Paulmzr avatar May 22 '25 16:05 Paulmzr

Hi @Paulmzr Could you implement the training of multilingual TTS model as single model for multiple languages?

rose-jinyang avatar May 22 '25 16:05 rose-jinyang

Hi @Paulmzr I think that the tokenizer should be extended for Indic languages. How can I do it?

rose-jinyang avatar May 22 '25 18:05 rose-jinyang

Hi @rose-jinyang Current codes do support training a multilingual model. You only need to organize your multilingual dataset and modify the data_path and manifest_path in the training script. https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L93 https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L94

Yes, tokenizer should also be modified to support Indic languages. You can change the following code: https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L288 to

tokenizer = AutoTokenizer.from_pretrained("...",padding_side="left",add_eos_token=True)

to use a tokenizer supporting your language.

Paulmzr avatar May 23 '25 01:05 Paulmzr

Thanks for your quick reply. I am going to use SUTRA tokenizer (https://huggingface.co/TWO/sutra-mlt256-v2) as the paper (https://arxiv.org/pdf/2411.12240v1). One more question: Should the "language" key of each voice exist in the dataset's manifest file as in the original LibreHeavy dataset?

rose-jinyang avatar May 23 '25 03:05 rose-jinyang

Thanks for your quick reply. I am going to use SUTRA tokenizer (https://huggingface.co/TWO/sutra-mlt256-v2) as the paper (https://arxiv.org/pdf/2411.12240v1). One more question: Should the "language" key of each voice exist in the dataset's manifest file as in the original LibreHeavy dataset?

@rose-jinyang This key is optional; the model doesn’t need to receive a language tag explicitly.

Paulmzr avatar May 23 '25 03:05 Paulmzr

Thank you @Paulmzr

rose-jinyang avatar May 23 '25 03:05 rose-jinyang

Hi @Paulmzr What about using the following codec rather than EnCodec? https://github.com/jishengpeng/WavTokenizer

rose-jinyang avatar May 23 '25 05:05 rose-jinyang

Hi @Paulmzr What about using the following codec rather than EnCodec? https://github.com/jishengpeng/WavTokenizer

@rose-jinyang I haven't tried it. It might work. However, I noticed that this paper reports that wavtokenizer is worse than EnCodec. (Please refer to Table 1)

Paulmzr avatar May 23 '25 08:05 Paulmzr

Hi @Paulmzr If so, what about using XCodec 2 for Llasa? EnCodec is for extracting continuous latent representations but I am not SURE if XCodec 2 is for discrete representation like WavTokenizer.

rose-jinyang avatar May 23 '25 08:05 rose-jinyang

Hi @Paulmzr If so, what about using XCodec 2 for Llasa? EnCodec is for extracting continuous latent representations but I am not SURE if XCodec 2 is for discrete representation like WavTokenizer.

@rose-jinyang You can use xcodec2. In practice, any codec can serve as an extractor of latent continuous representations. You could consider using the representations produced before the VQ/RVQ layer as latents, or alternatively add together the embeddings after quantization to build your latents. I’m not sure how well xcodec2 supports the Indian languages you need. In their paper (Section 3.1.1) they mention that the languages used to train xcodec2 do not include Indians.

Paulmzr avatar May 23 '25 12:05 Paulmzr

Hi @Paulmzr Got it, thanks For right now, I will use EnCodec. If it would NOT work well for Indian languages, I would train Xcodec 2 for Indian languages.

rose-jinyang avatar May 23 '25 14:05 rose-jinyang

Codecs and to a great extend Vocoders are usually language agnostic. So you should be fine either way.

alternatively Nvidia audio codec which was released a while ago, especially the 44.1khz version, also sound super good and is trained on a large set of languages. so you have that option as well.

Respaired avatar May 23 '25 22:05 Respaired

Hi @Paulmzr I made a script for extending tokenizer for multiple Indian languages.

import argparse
import os
import pandas as pd
import json
import shutil

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer


def combine_tokenizers(old_tokenizer, new_tokenizer, save_dir):
    # Load both vocab files
    vocab1_path = os.path.join(old_tokenizer, 'vocab.json')
    vocab2_path = os.path.join(new_tokenizer, 'vocab.json')
    merges1_path = os.path.join(old_tokenizer, 'merges.txt')
    merges2_path = os.path.join(new_tokenizer, 'merges.txt')

    with open(vocab1_path, 'r', encoding='utf-8') as f1, open(vocab2_path, 'r', encoding='utf-8') as f2:
        vocab1 = json.load(f1)
        vocab2 = json.load(f2)

    # Combine vocabularies without duplication
    merged_vocab = {}
    idx = 0
    for word in vocab1:
        if word not in merged_vocab:
            merged_vocab[word] = idx
            idx += 1
    for word in vocab2:
        if word not in merged_vocab:
            merged_vocab[word] = idx
            idx += 1

    os.makedirs(save_dir, exist_ok=True)
    with open(os.path.join(save_dir, 'vocab.json'), 'w', encoding='utf-8') as fp:
        json.dump(merged_vocab, fp, ensure_ascii=False)

    # Merge merges.txt (handle duplicates)
    with open(merges1_path, 'r', encoding='utf-8') as f1, open(merges2_path, 'r', encoding='utf-8') as f2:
        lines1 = f1.readlines()
        lines2 = f2.readlines()[1:]  # skip header "#version: 0.2"

    all_merges = list(dict.fromkeys(lines1 + lines2))  # deduplicate

    with open(os.path.join(save_dir, 'merges.txt'), 'w', encoding='utf-8') as f_out:
        f_out.writelines(all_merges)


def extend_tokenizer(args):
    root = args.output_path

    # Load original tokenizer
    tokenizer_model = BPE.from_file(
        vocab=os.path.join(root, 'vocab.json'),
        merges=os.path.join(root, 'merges.txt')
    )
    existing_tokenizer = Tokenizer(tokenizer_model)
    existing_tokenizer.pre_tokenizer = Whitespace()

    # Save old tokenizer model
    old_tokenizer_path = os.path.join(root, "old_tokenizer/")
    os.makedirs(old_tokenizer_path, exist_ok=True)
    existing_tokenizer.model.save(old_tokenizer_path)

    # Load training data
    traindf = pd.read_csv(args.metadata_path, sep="|")
    texts = traindf['text'].astype(str).tolist()

    # Train new tokenizer
    new_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    new_tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(special_tokens=[f"[{args.language}]"], vocab_size=args.extended_vocab_size)
    new_tokenizer.train_from_iterator(texts, trainer=trainer)

    new_tokenizer_path = os.path.join(root, "new_tokenizer/")
    os.makedirs(new_tokenizer_path, exist_ok=True)
    new_tokenizer.model.save(new_tokenizer_path)

    # Merge old and new tokenizers
    merged_tokenizer_path = os.path.join(root, "merged_tokenizer/")
    combine_tokenizers(
        old_tokenizer_path,
        new_tokenizer_path,
        merged_tokenizer_path
    )

    # Load merged tokenizer
    tokenizer_model = BPE.from_file(
        vocab=os.path.join(merged_tokenizer_path, 'vocab.json'),
        merges=os.path.join(merged_tokenizer_path, 'merges.txt')
    )
    tokenizer = Tokenizer(tokenizer_model)
    tokenizer.pre_tokenizer = Whitespace()
    tokenizer.add_special_tokens([f"[{args.language}]"])

    # Save final tokenizer
    tokenizer.model.save(root)

    # Clean up
    shutil.rmtree(old_tokenizer_path)
    shutil.rmtree(new_tokenizer_path)
    shutil.rmtree(merged_tokenizer_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--output_path", type=str, default='tokenizer_bpe_libriheavy/', help="")
    parser.add_argument("--metadata_path", type=str, default='datasets/metadata_train.csv', help="")
    parser.add_argument("--language", type=str, default='hi', help="")
    parser.add_argument("--extended_vocab_size", default=10000, type=int, help="")

    args = parser.parse_args()

    indian_languages = [
        ('Telugu', 'te'),       # ✅ Whisper supported
        ('Hindi', 'hi'),        # ✅ Whisper supported
        ('Assamese', 'as'),     # ✅ Whisper supported
        ('Bodo', 'brx'),        # ❌ Not supported in Whisper
        ('Gujarati', 'gu'),     # ✅ Whisper supported
        ('Kannada', 'kn'),      # ✅ Whisper supported
        ('Konkani', 'kok'),     # ❌ Not supported in Whisper
        ('Malayalam', 'ml'),    # ✅ Whisper supported
        ('Marathi', 'mr'),      # ✅ Whisper supported
        ('Odia', 'or'),         # ❌ Not officially supported (sometimes included in community models)
        ('Santali', 'sat'),     # ❌ Not supported in Whisper
        ('Sanskrit', 'sa'),     # ❌ Not officially supported (may work in multilingual finetunes)
        ('Tamil', 'ta'),        # ✅ Whisper supported
        ('Urdu', 'ur'),         # ✅ Whisper supported
        ('Bengali', 'bn'),      # ✅ Whisper supported
        ('Dogri', 'doi'),       # ❌ Not supported in Whisper
        ('Kashmiri', 'ks'),     # ❌ Not supported in Whisper
        ('Maithili', 'mai'),    # ❌ Not supported in Whisper
        ('Manipuri', 'mni'),    # ❌ Not supported in Whisper
        ('Nepali', 'ne'),       # ✅ Whisper supported
        ('Punjabi', 'pa'),      # ✅ Whisper supported
        ('Sindhi', 'sd'),       # ❌ Not supported in Whisper
    ]

    for full_name, sim_name in indian_languages:
        train_metadata = f'/home/jupyter/Jin/Data/IndicVoices-R/train/{full_name}/metadata.csv'
        if not os.path.exists(train_metadata):
            print(f"[Warning] Skipping {full_name}: metadata not found at {train_metadata}")
            continue

        args.metadata_path = train_metadata
        args.language = sim_name

        print(f"🔧 Extending tokenizer for {full_name} ({sim_name})...")
        extend_tokenizer(args)
        print(f"✅ Done for {full_name} ({sim_name})\n")

Could you check?

rose-jinyang avatar May 27 '25 13:05 rose-jinyang

Hi @Paulmzr How are you? Could you check if the following audio codec (FlowDec) can be used instead of EnCodec? https://openreview.net/pdf?id=uxDFlPGRLX demo: https://sp-uhh.github.io/FlowDec/

rose-jinyang avatar Jun 04 '25 03:06 rose-jinyang

Hi thank you for this very interesting discussion, do you recommend any tokeniser for french language ?

Alekksander66 avatar Jun 09 '25 10:06 Alekksander66