To train a multilingual model for multiple Indian languages
Hello How are you? Thanks for contributing to this project. Is it possible to train a multilingual model for multiple Indian languages?
@rose-jinyang Hi, thank you for your interest.
You can train a multilingual model as long as you have the right dataset; you only need to modify the following two functions in the training script to make it work with your data:
https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L124 https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L171
One potential issue is that the current code uses EnCodec to extract continuous latent representations. I’m not sure how well EnCodec supports Indian languages. If you want to use a different model to obtain the latent representations, you’ll also need to adjust the feature-extraction section in the model file:
https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/sled/sled.py#L127
Hi @Paulmzr Could you implement the training of multilingual TTS model as single model for multiple languages?
Hi @Paulmzr I think that the tokenizer should be extended for Indic languages. How can I do it?
Hi @rose-jinyang Current codes do support training a multilingual model. You only need to organize your multilingual dataset and modify the data_path and manifest_path in the training script. https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L93 https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L94
Yes, tokenizer should also be modified to support Indic languages. You can change the following code: https://github.com/ictnlp/SLED-TTS/blob/b8ed10d9953160efd8a0538b4ea5af80a57c9e96/scripts/train_libriheavy.py#L288 to
tokenizer = AutoTokenizer.from_pretrained("...",padding_side="left",add_eos_token=True)
to use a tokenizer supporting your language.
Thanks for your quick reply. I am going to use SUTRA tokenizer (https://huggingface.co/TWO/sutra-mlt256-v2) as the paper (https://arxiv.org/pdf/2411.12240v1). One more question: Should the "language" key of each voice exist in the dataset's manifest file as in the original LibreHeavy dataset?
Thanks for your quick reply. I am going to use SUTRA tokenizer (https://huggingface.co/TWO/sutra-mlt256-v2) as the paper (https://arxiv.org/pdf/2411.12240v1). One more question: Should the "language" key of each voice exist in the dataset's manifest file as in the original LibreHeavy dataset?
@rose-jinyang This key is optional; the model doesn’t need to receive a language tag explicitly.
Thank you @Paulmzr
Hi @Paulmzr What about using the following codec rather than EnCodec? https://github.com/jishengpeng/WavTokenizer
Hi @Paulmzr What about using the following codec rather than EnCodec? https://github.com/jishengpeng/WavTokenizer
@rose-jinyang I haven't tried it. It might work. However, I noticed that this paper reports that wavtokenizer is worse than EnCodec. (Please refer to Table 1)
Hi @Paulmzr If so, what about using XCodec 2 for Llasa? EnCodec is for extracting continuous latent representations but I am not SURE if XCodec 2 is for discrete representation like WavTokenizer.
Hi @Paulmzr If so, what about using XCodec 2 for Llasa? EnCodec is for extracting continuous latent representations but I am not SURE if XCodec 2 is for discrete representation like WavTokenizer.
@rose-jinyang You can use xcodec2. In practice, any codec can serve as an extractor of latent continuous representations. You could consider using the representations produced before the VQ/RVQ layer as latents, or alternatively add together the embeddings after quantization to build your latents. I’m not sure how well xcodec2 supports the Indian languages you need. In their paper (Section 3.1.1) they mention that the languages used to train xcodec2 do not include Indians.
Hi @Paulmzr Got it, thanks For right now, I will use EnCodec. If it would NOT work well for Indian languages, I would train Xcodec 2 for Indian languages.
Codecs and to a great extend Vocoders are usually language agnostic. So you should be fine either way.
alternatively Nvidia audio codec which was released a while ago, especially the 44.1khz version, also sound super good and is trained on a large set of languages. so you have that option as well.
Hi @Paulmzr I made a script for extending tokenizer for multiple Indian languages.
import argparse
import os
import pandas as pd
import json
import shutil
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
def combine_tokenizers(old_tokenizer, new_tokenizer, save_dir):
# Load both vocab files
vocab1_path = os.path.join(old_tokenizer, 'vocab.json')
vocab2_path = os.path.join(new_tokenizer, 'vocab.json')
merges1_path = os.path.join(old_tokenizer, 'merges.txt')
merges2_path = os.path.join(new_tokenizer, 'merges.txt')
with open(vocab1_path, 'r', encoding='utf-8') as f1, open(vocab2_path, 'r', encoding='utf-8') as f2:
vocab1 = json.load(f1)
vocab2 = json.load(f2)
# Combine vocabularies without duplication
merged_vocab = {}
idx = 0
for word in vocab1:
if word not in merged_vocab:
merged_vocab[word] = idx
idx += 1
for word in vocab2:
if word not in merged_vocab:
merged_vocab[word] = idx
idx += 1
os.makedirs(save_dir, exist_ok=True)
with open(os.path.join(save_dir, 'vocab.json'), 'w', encoding='utf-8') as fp:
json.dump(merged_vocab, fp, ensure_ascii=False)
# Merge merges.txt (handle duplicates)
with open(merges1_path, 'r', encoding='utf-8') as f1, open(merges2_path, 'r', encoding='utf-8') as f2:
lines1 = f1.readlines()
lines2 = f2.readlines()[1:] # skip header "#version: 0.2"
all_merges = list(dict.fromkeys(lines1 + lines2)) # deduplicate
with open(os.path.join(save_dir, 'merges.txt'), 'w', encoding='utf-8') as f_out:
f_out.writelines(all_merges)
def extend_tokenizer(args):
root = args.output_path
# Load original tokenizer
tokenizer_model = BPE.from_file(
vocab=os.path.join(root, 'vocab.json'),
merges=os.path.join(root, 'merges.txt')
)
existing_tokenizer = Tokenizer(tokenizer_model)
existing_tokenizer.pre_tokenizer = Whitespace()
# Save old tokenizer model
old_tokenizer_path = os.path.join(root, "old_tokenizer/")
os.makedirs(old_tokenizer_path, exist_ok=True)
existing_tokenizer.model.save(old_tokenizer_path)
# Load training data
traindf = pd.read_csv(args.metadata_path, sep="|")
texts = traindf['text'].astype(str).tolist()
# Train new tokenizer
new_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
new_tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=[f"[{args.language}]"], vocab_size=args.extended_vocab_size)
new_tokenizer.train_from_iterator(texts, trainer=trainer)
new_tokenizer_path = os.path.join(root, "new_tokenizer/")
os.makedirs(new_tokenizer_path, exist_ok=True)
new_tokenizer.model.save(new_tokenizer_path)
# Merge old and new tokenizers
merged_tokenizer_path = os.path.join(root, "merged_tokenizer/")
combine_tokenizers(
old_tokenizer_path,
new_tokenizer_path,
merged_tokenizer_path
)
# Load merged tokenizer
tokenizer_model = BPE.from_file(
vocab=os.path.join(merged_tokenizer_path, 'vocab.json'),
merges=os.path.join(merged_tokenizer_path, 'merges.txt')
)
tokenizer = Tokenizer(tokenizer_model)
tokenizer.pre_tokenizer = Whitespace()
tokenizer.add_special_tokens([f"[{args.language}]"])
# Save final tokenizer
tokenizer.model.save(root)
# Clean up
shutil.rmtree(old_tokenizer_path)
shutil.rmtree(new_tokenizer_path)
shutil.rmtree(merged_tokenizer_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--output_path", type=str, default='tokenizer_bpe_libriheavy/', help="")
parser.add_argument("--metadata_path", type=str, default='datasets/metadata_train.csv', help="")
parser.add_argument("--language", type=str, default='hi', help="")
parser.add_argument("--extended_vocab_size", default=10000, type=int, help="")
args = parser.parse_args()
indian_languages = [
('Telugu', 'te'), # ✅ Whisper supported
('Hindi', 'hi'), # ✅ Whisper supported
('Assamese', 'as'), # ✅ Whisper supported
('Bodo', 'brx'), # ❌ Not supported in Whisper
('Gujarati', 'gu'), # ✅ Whisper supported
('Kannada', 'kn'), # ✅ Whisper supported
('Konkani', 'kok'), # ❌ Not supported in Whisper
('Malayalam', 'ml'), # ✅ Whisper supported
('Marathi', 'mr'), # ✅ Whisper supported
('Odia', 'or'), # ❌ Not officially supported (sometimes included in community models)
('Santali', 'sat'), # ❌ Not supported in Whisper
('Sanskrit', 'sa'), # ❌ Not officially supported (may work in multilingual finetunes)
('Tamil', 'ta'), # ✅ Whisper supported
('Urdu', 'ur'), # ✅ Whisper supported
('Bengali', 'bn'), # ✅ Whisper supported
('Dogri', 'doi'), # ❌ Not supported in Whisper
('Kashmiri', 'ks'), # ❌ Not supported in Whisper
('Maithili', 'mai'), # ❌ Not supported in Whisper
('Manipuri', 'mni'), # ❌ Not supported in Whisper
('Nepali', 'ne'), # ✅ Whisper supported
('Punjabi', 'pa'), # ✅ Whisper supported
('Sindhi', 'sd'), # ❌ Not supported in Whisper
]
for full_name, sim_name in indian_languages:
train_metadata = f'/home/jupyter/Jin/Data/IndicVoices-R/train/{full_name}/metadata.csv'
if not os.path.exists(train_metadata):
print(f"[Warning] Skipping {full_name}: metadata not found at {train_metadata}")
continue
args.metadata_path = train_metadata
args.language = sim_name
print(f"🔧 Extending tokenizer for {full_name} ({sim_name})...")
extend_tokenizer(args)
print(f"✅ Done for {full_name} ({sim_name})\n")
Could you check?
Hi @Paulmzr How are you? Could you check if the following audio codec (FlowDec) can be used instead of EnCodec? https://openreview.net/pdf?id=uxDFlPGRLX demo: https://sp-uhh.github.io/FlowDec/
Hi thank you for this very interesting discussion, do you recommend any tokeniser for french language ?