Hi Everyone,

I am trying to finetune Parakeet v2 TDT to GramVani dataset link. Here is the configuration I am using https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/hindi_config.yaml

And the training script is available here https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/finetune.py The script is the usual fine-tuning script

The full code is available here

Here are some of the logs during training I trained only till 3 epochs.

  | Name              | Type                              | Params | Mode 
--------------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0      | train
1 | encoder           | ConformerEncoder                  | 608 M  | eval 
2 | spec_augmentation | SpectrogramAugmentation           | 0      | train
3 | wer               | WER                               | 0      | train
4 | joint             | RNNTJoint                         | 1.7 M  | train
5 | decoder           | RNNTDecoder                       | 7.2 M  | train
6 | loss              | RNNTLoss                          | 0      | train
7 | spec_augment      | SpectrogramAugmentation           | 0      | train
--------------------------------------------------------------------------------
9.0 M     Trainable params
608 M     Non-trainable params
617 M     Total params
2,471.304 Total estimated model params size (MB)
46        Modules in train mode
662       Modules in eval mode
Epoch 0:   0%|                                         | 0/9200 [00:00<?, ?it/s][NeMo I 2025-06-03 13:03:28 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:03:28 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
[NeMo W 2025-06-03 13:03:30 nemo_logging:405] Provided RNNT Joint tensor is of dtype torch.float16, but RNNT loss could not be calculated in fp16 due to following reason stated below. Loss will be calculated in fp32. 

NeMo I 2025-06-03 13:15:33 nemo_logging:393] reference:के चक्कर में सूती सारी ले होती है इसलिए कोयला की सुविधा हम झारखण्ड सरकार ऐसी कहेंगे की कोयला की सुविधा बढ़ाने लिए
[NeMo I 2025-06-03 13:15:33 nemo_logging:393] predicted:बारिश पू ग्राम के ब्यंग हुई ने खास आदाब आपकेेशनहचनालहह्ग मेंबी ख़ ख़ालतहगन निकाल निकालग निकाल निकाल्सलस है का का का का का कासcompधसletगगग माम सं कि ऐसी M का ख़ंह सकती होते्य सं हैं है के है का है
Epoch 0:  43%|▍| 3999/9200 [24:06<31:21,  2.76it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:27:35 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:27:35 nemo_logging:393] reference:गहरे पानी के अलावा ब्लीचिंग पाउडर का छिडकाव करना सफाई करना गहरे पानी पे कोई नहीं जाए इसलिए नागरिकों की रक्षा करना
[NeMo I 2025-06-03 13:27:35 nemo_logging:393] predicted:है वाणीते है को है की के का में है की में दो के को है को है की को वाणी के है न हहम है को है
Epoch 0:  65%|▋| 5999/9200 [36:05<19:15,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:39:34 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:39:34 nemo_logging:393] reference:तो चलिए सुनते है नया कार्यक्रम
[NeMo I 2025-06-03 13:39:34 nemo_logging:393] predicted:नमस्कार के लिए के लिए रही केेे के में कोजस के लिए को को की को की को को की को और को
Epoch 0:  87%|▊| 7999/9200 [48:07<07:13,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:51:36 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:51:36 nemo_logging:393] reference:ज़बरन शादी करा दी जा रही है बच्चों के अधिसूचित अधिकारों पे काम करने वाली अंतराष्ट्रीय
[NeMo I 2025-06-03 13:51:36 nemo_logging:393] predicted:नमस्कार मैं को और को की को की को है को को को को को और के लिए
Epoch 0: 100%|█| 9200/9200 [55:21<00:00,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Enabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Enabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
Epoch 1:   0%| | 0/9200 [00:00<?, ?it/s, v_num=2-55, train_step_timing in s=0.43[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
Epoch 1:   9%| | 799/9200 [04:51<51:07,  2.74it/s, v_num=2-55, train_step_timing[NeMo I 2025-06-03 14:03:42 nemo_logging:393] 
    
[NeMo I 2025-06-03 14:03:42 nemo_logging:393] reference:आप व अपनी राय या प्रतिक्रिया दे सकते हैं नों तीन दबा का हमें आपकी प्रतिक्रिया का इंतेज़ार रहेगा
[NeMo I 2025-06-03 14:03:42 nemo_logging:393] predicted:नमस्कार आदाब को और को और को और को को को
Epoch 1:  30%|▎| 2799/9200 [16:57<38:47,  2.75it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 14:15:47 nemo_logging:393]

Here is my WER plot of only the training batch.

Here is my training loss

As we can see we have three problems

WER is very poor.
1 epoch is taking a 56 mins on a P100 GPU with 16 GB VRAM. The model is training just 9 million parameters with encoder layer freezed.
The memory occupied in around 11 GB with just a batch size of 4.

Problem 1 High WER in the training itself.

We haven't even evaluated the validation WER and WER is high in training.

I suspected whether the bpe encoding scheme is not correctly applied or not. So I tested one word.

import sentencepiece as spm
vocab_file = 'tokenizer_output/vocab.txt'
model_prefix = 'tokenizer_output/tokenizer'
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

test_text = "नमस्कार मैं दीपक कुमार सिंह"
encoded = sp.encode_as_pieces(test_text)
print(f"\nTest encoding:")
print(f"Original: {test_text}...")
print(f"Encoded: {encoded}...")

I got

Test encoding:
Original: नमस्कार मैं दीपक कुमार सिंह...
Encoded: ['▁नमस्कार', '▁मैं', '▁दी', 'प', 'क', '▁कुमार', '▁सिंह']...

So I think it is working.

The bpe encoding code I am using is this

import sentencepiece as spm
import json
import os
from glob import glob

# Create output directory
os.makedirs('tokenizer_output', exist_ok=True)

# Extract texts from manifest
texts = []
with open('train_manifest.json', 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line.strip())
        if 'text' in data and data['text'].strip():
            texts.append(data['text'])

print(f"Found {len(texts)} texts for training")

# Save texts to document.txt (raw corpus)
document_file = 'tokenizer_output/document.txt'
with open(document_file, 'w', encoding='utf-8') as f:
    for text in texts:
        f.write(text + '\n')
print(f"Saved raw text corpus to {document_file}")

# Train SentencePiece model
model_prefix = 'tokenizer_output/tokenizer'
spm.SentencePieceTrainer.train(
    input=document_file,  # Now using document.txt directly
    model_prefix=model_prefix,
    vocab_size=1024,
    model_type='bpe',
    character_coverage=0.9995,
    normalization_rule_name='identity',
    remove_extra_whitespaces=False,
    max_sentence_length=4192,
    shuffle_input_sentence=True
)

print(f"Tokenizer saved as {model_prefix}.model and {model_prefix}.vocab")

# Create human-readable vocab.txt
vocab_file = 'tokenizer_output/vocab.txt'
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

with open(vocab_file, 'w', encoding='utf-8') as f:
    for i in range(sp.get_piece_size()):
        piece = sp.id_to_piece(i)
        f.write(f"{piece}\n")
print(f"Saved human-readable vocabulary to {vocab_file}")

It is using all the corpus available in the training set so no question of out of vocabulary words.

The next thing that is coming to my mind is un freezing the encoder layer. Maybe that could improve WER.
Then just increase the number of epochs let's say atleast 100.
Or increase the batch size from 4 to 16 like the original code (If my memory allows)
Change the augmentation parameters during training.

But more important question is whether the model will work languages other than English where data available is only 100 hours.

Problem 2 Slow Speed

One Epoch is taking around 56 mins on P100 GPU with 16 GB VRAM. Considering the number of trainable parameters of 9 million it seems slow to me.

Problem 3 Large Memory Occupancy

According to me 16K sampling rate with average duration of 15 seconds and batch size 16 and each amplitude value taking 4 bytes gives around 15 MB. But memory occupancy exceeds beyond 16 GB which forced me to use batch size of 4. Any clue why this happens? Also Is there any tool that gives me all the memory profiles of the GPU along with training logs?

Jun 04 '25 14:06 deepanshu-yadav

@deepanshu-yadav Hi~ Based on my previous fine-tuning experience, here are my responses to the following points:

Training a 0.6B model on a 16GB GPU is quite challenging. NeMo typically uses 80GB GPUs for training. Additionally, the computational power of a P100 might not be sufficient, so the training time can be quite long.
When training on a new language with a modified vocabulary, it's necessary to retrain the decoder. Ideally, the encoder should also be unfrozen. In the early epochs, it's common for the model to not output any characters. It usually starts producing output after around 3 epochs, depending on how much data you have per epoch. You’ll likely need at least 50 epochs of training.
100 hours of data is generally insufficient. You typically need at least 1000 hours of data to reach around 20% WER.
Memory occupancy mainly depends on the forward pass, backward pass, and the optimizer states. It’s not determined by the storage size of the audio files themselves.

Jun 05 '25 01:06 jeremy110

@deepanshu-yadav Hi~ Based on my previous fine-tuning experience, here are my responses to the following points:

Training a 0.6B model on a 16GB GPU is quite challenging. NeMo typically uses 80GB GPUs for training. Additionally, the computational power of a P100 might not be sufficient, so the training time can be quite long.

When training on a new language with a modified vocabulary, it's necessary to retrain the decoder. Ideally, the encoder should also be unfrozen. In the early epochs, it's common for the model to not output any characters. It usually starts producing output after around 3 epochs, depending on how much data you have per epoch. You’ll likely need at least 50 epochs of training.

100 hours of data is generally insufficient. You typically need at least 1000 hours of data to reach around 20% WER.

Memory occupancy mainly depends on the forward pass, backward pass, and the optimizer states. It’s not determined by the storage size of the audio files themselves.

Thanks it was very insightful. I am arranging a stable way to use a GPU. Will let you know what I found out.

Jun 09 '25 13:06 deepanshu-yadav

@deepanshu-yadav I would recommend at least a 3090 or 4090, which can train a 110M model, but you'll need at least 1,000 hours of data.

Jun 09 '25 14:06 jeremy110

@deepanshu-yadav any progress.

I have followed the same training script and unfreeze the encoder.

After 350 epoch, still the results are not good. I think as jeremy mentioned it needs more data.

reference:प्रखंड शिक्षा पद अधिकारी मुकलेश्वर शर्मा ने भी स्वच्छता अभियान मई भागीदारी सुनिश्चित करने की बात किया प्रखंड दर्जनों गुरूजी स्वच्छ   ता के लिए मिसाल कायम कर रहे
predicted:प्रखंड शिक्षा पदाधिकारी ने शादी का अभिकों की सुनवाई करने की हैं प्रखंड केजनुर स्वच्छ मालायें 


reference:समधी समध दो हजार एक ऐसी जानकारी उपनिदेशक लोग उठाते नहीं हैं वास्तव में विश्वास के सामने अपनी संख्या पैतालीस अठारह दिनांक छबी स                                                                                                                                       
predicted:फण्ड दो हजार एक ही जानकारी का उपाय और लोग सूखे हैं यह में विश्वास अपने संख्या पैतालीस के अठारह उन्नीस


reference:योग्य उमीदवारों ऐसी आवेदन पात्र मांगे है या संविधान के आधार आरोप होगी इसके लिए विभाग ने नोटिस जारी कर दिया है            
predicted:्यिवार ऐसी पत्ता मांगेDया संविदा आधार आरोप होगी इसके लिए विभाग निसरी जारी कर दिया हैoहसो


reference:को मेडिकल कचड़े के डिब्बे में झोक दिया जा रहा है जो न सिर्फ मरीज़ को संक्रमण का शिकार बना सकता है बल्कि ये व्यवस्था             
predicted:की मेडिकलों झो जा रहा जो जो सिर्फ संक्रमण के बना सकता है बनी व्यवस्था है गरीबी की

Jun 16 '25 05:06 BakingBrains

@BakingBrains Hi~ Did you also train with a small amount of data?

Here's a method I personally find quite effective: using AdamW8bit. If your machine supports it, it can reduce GPU memory usage, which in turn allows you to increase the batch size.

You’ll need to replace torch.nn.Embedding with bnb.nn.StableEmbedding in rnnt.py, and register adamw8bit in your training script.

import bitsandbytes as bnb
from dataclasses import dataclass

@dataclass
class OptimizerParams:
    """
    Base Optimizer params with no values. User can chose it to explicitly override via
    command line arguments
    """

    lr: Optional[float] = MISSING

@dataclass
class AdamW8bitParams(OptimizerParams):
    """
    Default configuration for AdamW optimizer.
    It is not derived from Config as it is not a NeMo object (and in particular it doesn't need a name).

    ..note:
        For the details on the function/meanings of the arguments, please refer to:
        https://pytorch.org/docs/stable/optim.html#torch.optim.AdamW
    """

    betas: Tuple[float, float] = (0.9, 0.999)
    eps: float = 1e-08
    weight_decay: float = 0
    amsgrad: bool = False

register_optimizer('adamw8bit', bnb.optim.AdamW8bit, AdamW8bitParams())

Jun 16 '25 05:06 jeremy110

Hello @jeremy110

Yeah, it is a 120 Hours of data.

Jun 16 '25 06:06 BakingBrains

@BakingBrains You could consider increasing the dataset to around 1,000 hours. In my own tests, the WER drops to around 18–20% at that point.

Jun 16 '25 06:06 jeremy110

Thank you for the suggestion @jeremy110. I will try that.

Jun 16 '25 06:06 BakingBrains

@deepanshu-yadav, @BakingBrains, Hi, In my case, problem occurred due to the dimension of the loss function(RNNT) in the problem below issue. I modified the RNNT function and confirmed that wer value decreased with a 300-hour sub-data set.

https://github.com/NVIDIA/NeMo/issues/14140

Jul 08 '25 00:07 leehyun22

Hello @jeremy110, I am trying to train 0.6B model(en) with 900hrs of training data (eng+hindi). Is it recommended to train encoder as well or decoder is enough?

Jul 16 '25 14:07 Amarnath1906

@Amarnath1906 Hi~ If you're like me and using a 4090 with only 24GB of memory, you can try my approach. If you have an 80GB GPU, I recommend training the encoder as well.

The following are my experiments with parakeet-rnnt-0.6b and parakeet-tdt_ctc-110m. I haven't tested them on parakeet-tdt-0.6b-v2, but the approach should be basically the same.

Use adamw8bit to reduce memory usage (details omitted here).
Merge the original model's tokenizer.model with the new language's .model (see the code below for reference). This is mainly to retain the model's original English capabilities.

from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm

def merge_tokenizer_models_with_vocab(model_path1, model_path2, output_model_path, vocab_output_path):
    # 載入兩個模型
    sp1 = spm.SentencePieceProcessor(model_file=model_path1)
    sp2 = spm.SentencePieceProcessor(model_file=model_path2)

    # 解析模型
    model_proto1 = sp_pb2_model.ModelProto()
    model_proto2 = sp_pb2_model.ModelProto()
    model_proto1.ParseFromString(sp1.serialized_model_proto())
    model_proto2.ParseFromString(sp2.serialized_model_proto())

    # 合併詞彙表
    new_model_proto = sp_pb2_model.ModelProto()
    vocab_set = set()
    vocab_list = []

    def add_pieces_to_model_proto(source_proto, target_proto, vocab_set, vocab_list, zh=False):

        for idx, piece in enumerate(source_proto.pieces):
            if zh and idx == 0:
                continue
            if zh == False and idx !=0:
                p1 = piece.piece#.upper()
            else:
                p1 = piece.piece
            
            if zh:
                p_score = piece.score - 1023.0
            else:
                p_score = piece.score
            if p1 not in vocab_set:
                vocab_set.add(p1)
                vocab_list.append((p1, p_score))
                new_piece = target_proto.pieces.add()
                new_piece.piece = p1
                new_piece.score = p_score
                if idx == 0:
                    new_piece.type = sp_pb2_model.ModelProto.SentencePiece.Type.UNKNOWN

    # 加入第一個模型的詞彙
    add_pieces_to_model_proto(model_proto1, new_model_proto, vocab_set, vocab_list)
    # 加入第二個模型的詞彙
    add_pieces_to_model_proto(model_proto2, new_model_proto, vocab_set, vocab_list, True)

    # 設定其他參數，例如 unigram 或 bpe 的處理方式 (取第一個模型的設定)
    print(model_proto1.trainer_spec, model_proto2.trainer_spec)
    new_model_proto.trainer_spec.MergeFrom(model_proto2.trainer_spec)
    new_model_proto.normalizer_spec.MergeFrom(model_proto2.normalizer_spec)

    # 儲存新的模型
    with open(output_model_path, 'wb') as f:
        f.write(new_model_proto.SerializeToString())
    print(f"New merged tokenizer saved to {output_model_path}")

    # 儲存詞彙表
    with open(vocab_output_path, 'w', encoding='utf-8') as f:
        for piece, score in vocab_list:
            f.write(f"{piece}\t{score}\n")
    print(f"Vocabulary saved to {vocab_output_path}")

# 使用範例
en_path = 'f644e5ef786442deb7c1726c7db0d44f_tokenizer.model'
zh_path = 'tokenizer.model'
merge_tokenizer_models_with_vocab(
    en_path, 
    zh_path, 
    r".\tokenizer2.model", 
    r".\tokenizer2.vocab"
)

sp = spm.SentencePieceProcessor()
sp.load(r".\tokenizer2.model")

initialize weights from original model in your training script.

# 1. 保留原本的 weight， 這裡嘗試 hybrid 架構
ori_decoder_prediction_embed = asr_model.decoder.prediction.embed
ori_decoder_prediction_dec_rnn = asr_model.decoder.prediction.dec_rnn
ori_joint_pred = asr_model.joint.pred
ori_joint_enc = asr_model.joint.enc
ori_joint_joint_net_Linear = asr_model.joint.joint_net[2] # Linear
ori_ctc_decoder_decoder_layers_Conv1d = asr_model.ctc_decoder.decoder_layers[0] # Conv1d
prev_vocab_size = asr_model.tokenizer.vocab_size

# 2. 改變辭典
asr_model.change_vocabulary(
    new_tokenizer_dir = "merged_nemo1024_zh1000_110M", 
    new_tokenizer_type = "bbpe",
    new_decoder_config = cfg.model.decoder,
)
print(asr_model)
# print 模型參數
for name, param in asr_model.named_parameters():
    print(f"Layer: {name} | Shape: {param.shape}")

cur_vocab_size = asr_model.tokenizer.vocab_size
# 3. 將原先的 weight 指回原本的位置
if asr_model.tokenizer.vocab_size != prev_vocab_size:

    # 新增的權重初始化
    with torch.no_grad():
        # 3.1 Decoder 部分
        # torch.nn.init.xavier_uniform_(asr_model.decoder.prediction.embed.weight[1025: ])
        asr_model.decoder.prediction.embed.weight[: 1024] = ori_decoder_prediction_embed.weight[: 1024]
        asr_model.decoder.prediction.embed.weight[-1] = ori_decoder_prediction_embed.weight[-1]
        
        asr_model.decoder.prediction.dec_rnn = ori_decoder_prediction_dec_rnn

        # 3.2 Joint 部分
        asr_model.joint.pred = ori_joint_pred 
        asr_model.joint.enc = ori_joint_enc 

        # 後面 5 duration token + 1 padding
        asr_model.joint.joint_net[2].weight[: 1024] = ori_joint_joint_net_Linear.weight[: 1024]
        asr_model.joint.joint_net[2].bias[: 1024] = ori_joint_joint_net_Linear.bias[: 1024]
        asr_model.joint.joint_net[2].weight[-6:] = ori_joint_joint_net_Linear.weight[-6:]
        asr_model.joint.joint_net[2].bias[-6:] = ori_joint_joint_net_Linear.bias[-6:]

        # 3.3 CTC Decoder 部分
        asr_model.ctc_decoder.decoder_layers[0].weight[:1024] = ori_ctc_decoder_decoder_layers_Conv1d.weight[:1024]
        asr_model.ctc_decoder.decoder_layers[0].weight[-1] = ori_ctc_decoder_decoder_layers_Conv1d.weight[-1]
        asr_model.ctc_decoder.decoder_layers[0].bias[:1024] = ori_ctc_decoder_decoder_layers_Conv1d.bias[:1024]
        asr_model.ctc_decoder.decoder_layers[0].bias[-1] = ori_ctc_decoder_decoder_layers_Conv1d.bias[-1]

del ori_decoder_prediction_embed, ori_decoder_prediction_dec_rnn, ori_joint_pred, ori_joint_enc, ori_joint_joint_net_Linear, ori_ctc_decoder_decoder_layers_Conv1d

For the 0.6b model, you can freeze half of the encoder parameters. Training only half yields results comparable to full training.

for i, layer in enumerate(asr_model.encoder.layers):
    if i >= 18 or ( i >= 6 and i < 12) :  
        for param in layer.parameters():
            param.requires_grad = True
    else:  # Freeze the remaining layers
        for param in layer.parameters():
            param.requires_grad = False

If you find it troublesome, you can simply skip the first two points and just do the third one—it's less likely to go wrong. Hope this helps!

Jul 17 '25 01:07 jeremy110

Thank you. I will use the third approach as i am using code switching data and can use aggregate tokenizer. The 3rd point is really helpful.

Jul 17 '25 07:07 Amarnath1906

Hello @jeremy110 , can you please share the training script if possible?

Jul 17 '25 14:07 BakingBrains

@BakingBrains Hi~ Basically, you can also use the scripts provided by NeMo, but at the time I wanted to keep things simple, so I wrote my own version. If you're using version 2.4, you might need to make some adjustments when loading the model. Also, here's my script for the 110M model. The parameters in the YAML file are for reference. At the time, I trained on roughly 500 hours of data mainly for testing purposes. You can adjust them as needed.

ft_110M_en_zh.zip

Jul 17 '25 15:07 jeremy110

@jeremy110 Thanks a lot.

Jul 17 '25 15:07 BakingBrains

hi @leehyun22 , I have the similar problem and I also think it is because of the RNNT loss function. Could you please explain how you modified the RNNT function? Thanks a lot!

Aug 04 '25 16:08 siyingchenclaire

@siyingchenclaire , As you can see in the issue below, there's a part where the dimension of the RNNT loss function has been modified. https://github.com/NVIDIA-NeMo/NeMo/issues/14140

Sep 03 '25 01:09 leehyun22

Hi, @jeremy110 In the step of initializing weights from an original model,

# 1. 保留原本的 weight， 這裡嘗試 hybrid 架構
ori_decoder_prediction_embed = asr_model.decoder.prediction.embed
ori_decoder_prediction_dec_rnn = asr_model.decoder.prediction.dec_rnn
ori_joint_pred = asr_model.joint.pred
ori_joint_enc = asr_model.joint.enc
ori_joint_joint_net_Linear = asr_model.joint.joint_net[2] # Linear
ori_ctc_decoder_decoder_layers_Conv1d = asr_model.ctc_decoder.decoder_layers[0] # Conv1d
prev_vocab_size = asr_model.tokenizer.vocab_size

# 2. 改變辭典
asr_model.change_vocabulary(
    new_tokenizer_dir = "merged_nemo1024_zh1000_110M", 
    new_tokenizer_type = "bbpe",
    new_decoder_config = cfg.model.decoder,
)
print(asr_model)
# print 模型參數
for name, param in asr_model.named_parameters():
    print(f"Layer: {name} | Shape: {param.shape}")

cur_vocab_size = asr_model.tokenizer.vocab_size
# 3. 將原先的 weight 指回原本的位置
if asr_model.tokenizer.vocab_size != prev_vocab_size:

    # 新增的權重初始化
    with torch.no_grad():
        # 3.1 Decoder 部分
        # torch.nn.init.xavier_uniform_(asr_model.decoder.prediction.embed.weight[1025: ])
        asr_model.decoder.prediction.embed.weight[: 1024] = ori_decoder_prediction_embed.weight[: 1024]
        asr_model.decoder.prediction.embed.weight[-1] = ori_decoder_prediction_embed.weight[-1]
        
        asr_model.decoder.prediction.dec_rnn = ori_decoder_prediction_dec_rnn

        # 3.2 Joint 部分
        asr_model.joint.pred = ori_joint_pred 
        asr_model.joint.enc = ori_joint_enc 

        # 後面 5 duration token + 1 padding
        asr_model.joint.joint_net[2].weight[: 1024] = ori_joint_joint_net_Linear.weight[: 1024]
        asr_model.joint.joint_net[2].bias[: 1024] = ori_joint_joint_net_Linear.bias[: 1024]
        asr_model.joint.joint_net[2].weight[-6:] = ori_joint_joint_net_Linear.weight[-6:]
        asr_model.joint.joint_net[2].bias[-6:] = ori_joint_joint_net_Linear.bias[-6:]

        # 3.3 CTC Decoder 部分
        asr_model.ctc_decoder.decoder_layers[0].weight[:1024] = ori_ctc_decoder_decoder_layers_Conv1d.weight[:1024]
        asr_model.ctc_decoder.decoder_layers[0].weight[-1] = ori_ctc_decoder_decoder_layers_Conv1d.weight[-1]
        asr_model.ctc_decoder.decoder_layers[0].bias[:1024] = ori_ctc_decoder_decoder_layers_Conv1d.bias[:1024]
        asr_model.ctc_decoder.decoder_layers[0].bias[-1] = ori_ctc_decoder_decoder_layers_Conv1d.bias[-1]

del ori_decoder_prediction_embed, ori_decoder_prediction_dec_rnn, ori_joint_pred, ori_joint_enc, ori_joint_joint_net_Linear, ori_ctc_decoder_decoder_layers_Conv1d

How do we know which layers should be keep the original weights, in the case where the model structure is different from yours?

Sep 19 '25 11:09 Chonlasitsk

@Chonlasitsk Hi~ Which model are you using? This method only works if you use the same model and only change the vocabulary. So, if it’s 0.6B, you need to initialize with the original 0.6B model; if it’s 110M, then use the 110M initialization. What I did at the beginning was print out the original model along with the name and dimensions of each layer. Changing the vocabulary will modify the dimensions of nn.Embedding and its output, which is a bit more complicated to handle, but for the rest of the parameters, you can just point them back directly.

Sep 19 '25 12:09 jeremy110

@jeremy110 Thk for response 🙏🏻 I using parakeet-tdt-0.6b-v3. My goal is fine-tune this model on Thai language which is a language the model has never been trained on before, and right now I’m using the merge tokenizer method that you suggested above to merge the tokenizer of parakeet-v3 with a Thai tokenizer. However, I’m not sure about initializing the weights in each layer—specifically, which layers should keep the original model’s weights. But if I had to guess, it would probably be the layers related to the vocabulary size, right?

Sep 19 '25 13:09 Chonlasitsk

@Chonlasitsk hi~ If you only want to train Thai, then you just need to change to the Thai vocabulary, and you can train directly. If you want to keep English or other languages as well, then you’ll need to use the following approach, and during training you’ll need around 500–1000 hours of data to prevent the model from forgetting what it has already learned.

I’m not sure about the vocabulary size in v3; let’s assume it’s 4096. In that case, you need to change the initialization code from 1024 to 4096. If it doesn’t include the CTC part, you can comment it out, and just keep the other joint parts.

Looking forward to your training results.

Sep 19 '25 13:09 jeremy110

@jeremy110 I tried following your approach and before fine-tuning the model, I tested it with English transcription — which should normally work correctly — the results turned out to be completely random every single time.

Here is code:

asr_model = nemo_asr.models.ASRModel.from_pretrained(args.model_name)

prev_vocab_size = asr_model.tokenizer.vocab_size

ori_decoder_prediction_embed = asr_model.decoder.prediction.embed
ori_decoder_prediction_dec_rnn = asr_model.decoder.prediction.dec_rnn
ori_joint_pred = asr_model.joint.pred
ori_joint_enc = asr_model.joint.enc
ori_joint_joint_net_Linear = asr_model.joint.joint_net[2] # Linear
# ori_ctc_decoder_decoder_layers_Conv1d = asr_model.ctc_decoder.decoder_layers[0] # Conv1d
asr_model.change_vocabulary(
      new_tokenizer_dir = "merged_nemo_tdt_v3", 
      new_tokenizer_type = "bpe",
  )

cur_vocab_size = asr_model.tokenizer.vocab_size

if asr_model.tokenizer.vocab_size != prev_vocab_size:

      with torch.no_grad():
          # 3.1 Decoder 
          asr_model.decoder.prediction.embed.weight[:8192] = ori_decoder_prediction_embed.weight[:8192]
          asr_model.decoder.prediction.embed.weight[-1] = ori_decoder_prediction_embed.weight[-1]
          asr_model.decoder.prediction.dec_rnn = ori_decoder_prediction_dec_rnn

          # # 3.2 Joint 
          asr_model.joint.pred = ori_joint_pred 
          asr_model.joint.enc = ori_joint_enc 

          # #  5 duration token + 1 padding
          asr_model.joint.joint_net[2].weight[:8192] = ori_joint_joint_net_Linear.weight[:8192]
          asr_model.joint.joint_net[2].bias[:8192] = ori_joint_joint_net_Linear.bias[:8192]
          asr_model.joint.joint_net[2].weight[-6:] = ori_joint_joint_net_Linear.weight[-6:]
          asr_model.joint.joint_net[2].bias[-6:] = ori_joint_joint_net_Linear.bias[-6:]

      del ori_decoder_prediction_embed, ori_decoder_prediction_dec_rnn, ori_joint_pred, ori_joint_enc, ori_joint_joint_net_Linear
# inference
output = asr_model.transcribe(["eng-songed.mp3"])
print(output[0].text)

Here is output:

ไชยศิ เข้ม? เข้มธีคะ เข เขหลาย้ม เข เข้มหลายกลาง เข้ม เข เข้ม เขหลายสุรินทร์ไชยศิริ เข้มอาจอีก้มวรรอีกฟักข้าวจีพอเรีย้อนหน่อยหน่อยิ โดยหน่อยเจเท่าื้อหน้า เขตามขธีี่กลางกลางพื้นอาจเรียฉันปลทรีอินรร เขผึ้งจํากลางเรีย?หน่อยซื้ออีกตากไชยศิแบกันทรีอินญคงพบเพียงท เพฌข้าวจีเอฉันเรียธีตากตากธอเซหน่อยเรียวงตากปลตากตร์ตุ แต่ข้าวจี เขหน่อยซื้ออีกมิธีทรีอินญญเบกําที่จะโลญปลพิเศษนานรัฐทํางาน้งู"ครั้งญญเบอาจญเอยวรรกันกันหลายบอกหลายไหบอกญเอไห้ง เขญวรรกัน่าญอาจอาจชนฉันฉันโลญปลฯฉันตุญรินทร์ห่อละวนบอก เขนักวน?้ง้อน่าวตากียวโลไชยศิ?ตากธีวน้งเปิดเอ้อน เขญไชยศิแบธีเอตากปลค์ เข่าว<eos>ไหกลาง?ภูหมอกอาจกันกันคํา้งกลาง?เอกลางกลางชีวิตเรีย?ธีไชยศิทรีอินทรีอิน เขธีกลางกลางกลางอาจึ่งหน่อยฟักฟัก คุณวิตากุ่มญเบปลเอญไหฟาร์มตุดิหน่อยหน่อยหน่อยทรีอินเบหลาย?โลญ แต่เกิน?เกิน เข เข เขกลางโลกลางร้อยกลางกลางอบบาทครับี่กัน เขเซพอโลญญมิ เขบ้างแล้วครั้ง?กลางกลางข้าวจีไชยศิเอน้องข้าวจีเอกลางกลาง?หน่อยเฉ่า เขหน่อยแล้วหมสามารถตากอีกพื้นขายข้าวหอมมะลิหไว้ฉันหลาย เขกันโลตากชน?เกิน?ตาม่าวฉันกลางตามพอเรียเบอาจกันคําปลหน่อย็มหมมาก้งกลางตากไชยศิข้าวจีกลางกลางยุ<s>กลางรถพอพอกลางกลาง้ม?ูด้มอาจรถรถพอกลางกลาง้มกลางมะโลกลางหลายโลญียงห้าม่พื้นกลางกลางข้าวจี?ซื้อกัน เขส เขเอดาไหรัฐพอ?มะหน้า้างฉันฟาร์มขวดละกลางกลางตากไชยศิญญทรีอินค่าคําปล้มซื้อกันคําบนอดญญสเอเซ เข เขโลเอเติที่เรีย?เอผลพออยพื้นดาไถามบอกห่อละเรีย แต่กลางข้าวจี เขฏอีกคําแดง เขเอฉันเซเดินชนรวมเกินาสแคหน่อยชารัฐปล้งตากเวหน่อย เข่วมตากเกินโลโลกลางออกไป?ียนครั้งรัฐพอ?ทาง?ตากตากียน เข?เกินเอ เขหลายญญเบ เขญไงรรรมมิหน่อยหน่อย"บอกญเอ เขเรีย เข เขข้าวจีฟักข้าวจี

Did I do something wrong somewhere?

Sep 21 '25 09:09 Chonlasitsk

@Chonlasitsk Hi~ I remember that after I changed the vocabulary, it was still able to generate English results normally. I might give you a script later today or tomorrow, and let me take some time to double-check the function for changing the vocabulary—I recall that if some parameters weren’t specified, it would default to something else.

Also, could you provide me with your audio file and tokens? That would make it easier to verify the results.

Sep 21 '25 09:09 jeremy110

@jeremy110 Sure https://drive.google.com/drive/folders/1Iy0cAudTPUzacdgvMsYReFZDkfwYGY4T?usp=sharing Here is merge tokenizer code, I made a slight adjustment.

from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import nemo.collections.asr as nemo_asr
import os

def merge_tokenizer_models_with_vocab(original_model_name, model_path2, output_model_path, vocab_output_path):
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(original_model_name)
    sp1 = asr_model.tokenizer.tokenizer
    sp2 = spm.SentencePieceProcessor(model_file=model_path2)

    model_proto1 = sp_pb2_model.ModelProto()
    model_proto2 = sp_pb2_model.ModelProto()
    model_proto1.ParseFromString(sp1.serialized_model_proto())
    model_proto2.ParseFromString(sp2.serialized_model_proto())

    new_model_proto = sp_pb2_model.ModelProto()
    vocab_set = set()
    vocab_list = []

    def add_pieces_to_model_proto(source_proto, target_proto, vocab_set, vocab_list, th=False):

        for idx, piece in enumerate(source_proto.pieces):
            if th and idx == 0:
                continue
            if th == False and idx !=0:
                p1 = piece.piece#.upper()
            else:
                p1 = piece.piece
            
            if th:
                p_score = piece.score - 8191.0
            else:
                p_score = piece.score
            if p1 not in vocab_set:
                vocab_set.add(p1)
                vocab_list.append((p1, p_score))
                new_piece = target_proto.pieces.add()
                new_piece.piece = p1
                new_piece.score = p_score
                if idx == 0:
                    new_piece.type = sp_pb2_model.ModelProto.SentencePiece.Type.UNKNOWN

    add_pieces_to_model_proto(model_proto1, new_model_proto, vocab_set, vocab_list)
    add_pieces_to_model_proto(model_proto2, new_model_proto, vocab_set, vocab_list, True)

    print(model_proto1.trainer_spec, model_proto2.trainer_spec)
    new_model_proto.trainer_spec.MergeFrom(model_proto2.trainer_spec)
    new_model_proto.normalizer_spec.MergeFrom(model_proto2.normalizer_spec)

    with open(output_model_path, 'wb') as f:
        f.write(new_model_proto.SerializeToString())
    print(f"New merged tokenizer saved to {output_model_path}")

    with open(vocab_output_path, 'w', encoding='utf-8') as f:
        for piece, score in vocab_list:
            f.write(f"{piece}\t{score}\n")
    print(f"Vocabulary saved to {vocab_output_path}")

if __name__ == "__main__":
    th_path = 'full_tokenizer_th_nemo/tokenizer.model'
    merge_tokenizer_models_with_vocab(
        "nvidia/parakeet-tdt-0.6b-v3", 
        th_path, 
        "merged_nemo_tdt_v3/tokenizer.model", 
        "merged_nemo_tdt_v3/vocab.txt"
    )

FYI: I tested with the parakeet-tdt-0.6b-v2 model and it worked fine, and the only difference between the two models is the tokenizer size.

Sep 21 '25 10:09 Chonlasitsk

@Chonlasitsk hi~ I just tried the v3 model, and it really doesn’t work—I’m not too sure why. Also, I noticed that every time I change the dictionary, the transcription output is different, which is quite strange. If you’ve tested that the v2 model works, then initializing with v2 is fine as well. One more thing to be careful about: make sure to use TDTLossNumba. In rnnt_bpe_models.py, you need to modify change_vocabulary like this (https://github.com/NVIDIA-NeMo/NeMo/pull/14155):

        # del self.loss
        # self.loss = RNNTLoss(num_classes=self.joint.num_classes_with_blank - 1)
        loss_kwargs = {
            "fastemit_lambda": 0.0,
            "clamp": -1.0,
            "durations": [0, 1, 2, 3, 4],
            "sigma": 0.02,
            "omega": 0.1,
        }
        self.loss = RNNTLoss(num_classes=self.joint.num_classes_with_blank - 1 - self.joint.num_extra_outputs, loss_name = 'tdt', loss_kwargs = loss_kwargs)

Sep 21 '25 12:09 jeremy110

@jeremy110 Thank you. So in conclusion, does that mean the v3 model cannot be used with this approach?

Sep 21 '25 13:09 Chonlasitsk

@Chonlasitsk Yes, I guess it’s probably related to the special tokens. But you can still initialize it this way, and after a bit of training, the English part should recover. However, I would recommend initializing with v2 first.

Sep 21 '25 13:09 jeremy110

Hi @jeremy110

I was trying to implement finetuning on Parakeet tdt v2 0.6b on the exact same dataset what @deepanshu-yadav mentioned and added more data to it to make it more than 1000 hours in total , I had followed exact same set of steps as per dicussions on various forms that were mentioned by you

which included

Merging the existing tokenizer of parakeet with new tokenizer
Freezing half encoder with original weights throughout
Randomly initalizing the decoder weights

I am using a p5.4x large instance and have 80 GB vram available the current batch size I am using is 32 with grad_acc of 4

I am running into a weird issue where my val_wer decreased drastically in inital epochs that is 2 epochs but it kept increasing at later stage, although its just been 11 epochs is this behaviour quite expected or should I try changing some settings for better results ?

The below results are just for 11 epochs btw

Sep 25 '25 07:09 mleharsh2ai

@mleharsh2ai Hi~~

If you’re using 80GB of memory, you can unfreeze all parameters. With only 24GB, you’ll need to freeze half of them.
I’d recommend using Lhotse—compared to the original dataloader, it’s more efficient. The batch size will adjust dynamically. You can experiment with the duration; with 80GB memory, you should be able to set it to around 600–800. Try to keep GPU memory usage around 90%. For gradient accumulation, you can set it to 2 or 4, both should work.
Generally, you’ll see a sharp drop around 10k–20k steps (below is what I observed when training a 110M model). You can use that as a reference; I remember it was roughly similar.And training to around 100k steps should be about enough. If the loss curve is still going down, you can keep going.

Sep 25 '25 08:09 jeremy110

@jeremy110 Thanks for the quick reply , from what I can observe the model you are pointing to is 110m and what I am using is 0.6bv2 will this perfomance replicate for the same, also could you help me understand how can lhoste and unfreezing the entire encoder lead to much better perfomance(in terms of convergence) since currently what I see is random fluctuations happening even at 15k steps with half freezed encoder and full unfreeze decoder

Sep 25 '25 08:09 mleharsh2ai

NeMo
NeMo copied to clipboard

Poor WER when trying to fine-tune Parakeet v2 TDT to other dataset than English

Problem 1 High WER in the training itself.

Problem 2 Slow Speed

Problem 3 Large Memory Occupancy

NeMo NeMo copied to clipboard

Poor WER when trying to fine-tune Parakeet v2 TDT to other dataset than English

Problem 1 High WER in the training itself.

Problem 2 Slow Speed

Problem 3 Large Memory Occupancy

NeMo
NeMo copied to clipboard