FunASR
FunASR copied to clipboard
关于LLM-ASR-NAR的训练疑问
❓ Questions and Help
###问题1 请问有llm-asr-nar的conf吗,我感觉我自己配置的文件不是很对:
model: LLMASRNAR
model_conf:
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
encoder: SANMEncoder
encoder_conf:
hub: funasr
init_param_path: "/ssd/zhuang/code/LLM/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/"
freeze: true
llm: Qwen2.5-1.5B-Instruct
llm_conf:
hub: hf
freeze: true
init_param_path: "/ssd/zhuang/code/LLM/qwen2.5-1.5B-Instruct/"
adaptor: Linear
adaptor_conf:
downsample_rate: 1
llm_dim: 1536
encoder_dim: 512
frontend: WavFrontend
frontend_conf:
fs: 16000
window: hamming
n_mels: 80
frame_length: 25
frame_shift: 10
lfr_m: 7
lfr_n: 6
cmvn_file: "/ssd/zhuang/code/LLM/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/am.mvn"
specaug: SpecAugLFR
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
train_conf:
accum_grad: 1
grad_clip: 5
max_epoch: 150
keep_nbest_models: 10
log_interval: 150
optim: adamw
optim_conf:
lr: 0.0001
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 1500
dataset: AudioLLMQwenAudioDataset
dataset_conf:
index_ds: IndexDSJsonl
batch_sampler: CustomDistributedBatchSampler
batch_type: example # example or length
batch_size: 16 # if batch_type is example, batch_size is the numbers of samples; if length, batch_size is source_token_len+target_token_len;
max_token_length: 3000
shuffle: True
num_workers: 4
preprocessor_text: TextPreprocessRemovePunctuation
audio_adaptor_downsample_rate: ${adaptor_conf.downsample_rate}
audio_encoder_downsample_rate: 1
tokenizer: HuggingfaceTokenizer
tokenizer_conf:
unk_symbol: <unk>
init_param_path: "/ssd/zhuang/code/LLM/qwen2.5-1.5B-Instruct/"
###问题2
请问在funasr/model/llm_asr_nar.py中192-199中:
if audio_mask is not None:
# audio_mask: [b, 1*enc_len+0*(prompt_len+label_len), 1]
batch_size, token_num, dims = inputs_embeds.shape # [b, 0*enc_len+prompt+label, dim]
_, l, _ = encoder_out.shape # [b, enc_len,dim]
encoder_outs_pad = F.pad(encoder_out, (0, 0, token_num - l - 1, 1, 0, 0), value=0.0)
# [b, 0*(prompt_len+label_len-1)+enc_len+0,dim]
inputs_embeds = encoder_outs_pad * audio_mask[:, :, None] + inputs_embeds * (1.0 - audio_mask[:, :, None] )
inputs_embeds = F.pad(inputs_embeds[:, 1:, :], (0, 0, 0, 1, 0, 0), value=0.0)
在这里inputs_embeds这里encoder_outs_pad在上一步已经被0填充了一段位移为什么还用原来没位移前的audio_mask去乘呢? 用这个代码训练很快就loss=0了,感觉不是很对呢。