FunASR 粤语识别出subword

粤语识别出subword

Open LRY1994 opened this issue 11 months ago • 2 comments

🐛 Bug

识别出来subword

茂名口音， gt : 好啲呢我觉得 pred: ho@@ al@@ ding ne@@ un@@ qu@@ ar@@ ter a

To Reproduce

model = AutoModel(model="dengcunqin/speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online", model_revision="master")

encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention

decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

chunk_size = [0, 10, 5] 

model.generate(input=path,
             chunk_size=chunk_size,
             encoder_chunk_look_back=encoder_chunk_look_back,
             decoder_chunk_look_back=decoder_chunk_look_back,
             is_final=True,
             output_dir=local_path)

Mar 07 '24 09:03 LRY1994

@@ means the token is subword. You could concat them via: replace('@@ ', '')

Mar 08 '24 17:03 LauraGPT

@@ means the token is subword. You could concat them via: replace('@@ ', '')

Can we perhaps add the post-processing statements for handling subwords to the pipelines for all languages? @LauraGPT

Apr 08 '24 03:04 tramphero

FunASR FunASR copied to clipboard

粤语识别出subword

🐛 Bug

To Reproduce

FunASR
FunASR copied to clipboard