FunASR
FunASR copied to clipboard
粤语识别出subword
🐛 Bug
识别出来subword
茂名口音, gt : 好 啲 呢 我 觉 得 pred: ho@@ al@@ ding ne@@ un@@ qu@@ ar@@ ter a
To Reproduce
model = AutoModel(model="dengcunqin/speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online", model_revision="master")
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
chunk_size = [0, 10, 5]
model.generate(input=path,
chunk_size=chunk_size,
encoder_chunk_look_back=encoder_chunk_look_back,
decoder_chunk_look_back=decoder_chunk_look_back,
is_final=True,
output_dir=local_path)
@@ means the token is subword. You could concat them via: replace('@@ ', '')
@@ means the token is subword. You could concat them via: replace('@@ ', '')
Can we perhaps add the post-processing statements for handling subwords to the pipelines for all languages? @LauraGPT