FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

paraformer finetune添加新词进行训练problem?

Open zw76859420 opened this issue 1 year ago • 2 comments

大佬,请教下: 我们有几个oov词添加到词典中进行训练,添加步骤如下: 1)词典添加(tokens.txt): 眀 瑧 2)训练使用paraformer原生代码(egs/aishell/s1/run.sh),修改如下: train.py
--task_name asr
--gpu_id $gpu_id
--use_preprocessor true
--token_type $token_type
--token_list $token_list
--dataset_type large
--data_dir ${feats_dir}/data
--train_set ${train_set}
--valid_set ${valid_set}
--data_file_names "wav.scp,text"
--cmvn_file model/am.mvn
--speed_perturb ${speed_perturb}
--resume true
--init_param model/model.pb
--ignore_init_mismatch true ...

最终训练结果loss如下: 2023-11-07 12:51:26,261 (build_trainer:733) INFO: 3epoch:train:1-50batch:124num_updates: iter_time=0.732, forward_time=6.521, loss_att=7.832, acc=2.894e-05, loss_pre=0.009, loss=7.841, backward_time=0.451, optim_step_time=0.052, optim0_lr0=1.992e-06, train_time=32.939

出现了如上的问题?请问是什么原因?

期待大佬们的回复

zw76859420 avatar Nov 07 '23 05:11 zw76859420

The current finetuning pipeline of FunASR does not support directly modifying the subword vocabulary to add OOV vocabulary for finetuning. If there is a need for this, the following modifications need to be made:

1)Modify Tokens.txt It is recommended to expand it further.

2)Modify model.pb After adding modeling units, the output layer of the model needs to be expanded accordingly, and the connections of the newly added modeling units are initialized randomly.

3)Finetuning the model Since OOV vocabulary has been added, the training data needs to have sufficient coverage so that it can be recognized after training.

Hope it will be helpful!

tramphero avatar Nov 07 '23 05:11 tramphero

多谢良博即时解答,明白了,期待Funasr越来越好!构建完整ASR生态

zw76859420 avatar Nov 07 '23 06:11 zw76859420