FunASR
FunASR copied to clipboard
paraformer finetune添加新词进行训练problem?
大佬,请教下:
我们有几个oov词添加到词典中进行训练,添加步骤如下:
1)词典添加(tokens.txt):
眀
--task_name asr
--gpu_id $gpu_id
--use_preprocessor true
--token_type $token_type
--token_list $token_list
--dataset_type large
--data_dir ${feats_dir}/data
--train_set ${train_set}
--valid_set ${valid_set}
--data_file_names "wav.scp,text"
--cmvn_file model/am.mvn
--speed_perturb ${speed_perturb}
--resume true
--init_param model/model.pb
--ignore_init_mismatch true
...
最终训练结果loss如下: 2023-11-07 12:51:26,261 (build_trainer:733) INFO: 3epoch:train:1-50batch:124num_updates: iter_time=0.732, forward_time=6.521, loss_att=7.832, acc=2.894e-05, loss_pre=0.009, loss=7.841, backward_time=0.451, optim_step_time=0.052, optim0_lr0=1.992e-06, train_time=32.939
出现了如上的问题?请问是什么原因?
期待大佬们的回复
The current finetuning pipeline of FunASR does not support directly modifying the subword vocabulary to add OOV vocabulary for finetuning. If there is a need for this, the following modifications need to be made:
1)Modify Tokens.txt It is recommended to expand it further.
2)Modify model.pb After adding modeling units, the output layer of the model needs to be expanded accordingly, and the connections of the newly added modeling units are initialized randomly.
3)Finetuning the model Since OOV vocabulary has been added, the training data needs to have sufficient coverage so that it can be recognized after training.
Hope it will be helpful!
多谢良博即时解答,明白了,期待Funasr越来越好!构建完整ASR生态