FunASR Pytorch 张量问题

❓ Questions and Help

What is your question?

我的问题是，在使用过程中，突然报错了： Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list

Code

` model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", # spk_model="cam++" )

res = model.generate(input=output_path, batch_size_s=300, hotword='魔搭') print(res) `

What have you tried?

使用过一段时间，一开始可以正常识别出语音，但是突然报错，未重启脚本，因为要排查这个问题。尝试过之前可识别的音频，重新调用识别，但是一样报错，网上查过说是：Pytorch 张量问题。但是这个在我自己的代码中并没有体现。

What's your environment?

OS: win10
FunASR Version: 1.0.25
ModelScope Version : 1.14.0
PyTorch Version: pytorch-wpe 版本0.0.1
How you installed funasr (pip, source): 使用pip install 直接安装
Python version: 3.11.7 使用的是cpu，未使用gpu

麻烦大佬帮忙看下~

May 10 '24 03:05 linrb685

Please show detail logs of error. Upload the wav file.

May 10 '24 03:05 LauraGPT

I got the same issue here, when using the cantonese model, here is the full log @LauraGPT :

Sizes of tensors must match except in dimension 2. Expected size 1 but got size 2 for tensor number 1 in the list.
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/scama/decoder.py", line 457, in forward_one_step
    x = torch.cat((x, pre_acoustic_embeds), dim=-1)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/scama/decoder.py", line 419, in score
    logp, state = self.forward_one_step(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 176, in score_full
    scores[k], states[k] = d.score(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 309, in search
    scores, states = self.score_full(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 410, in forward
    best = self.search(
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/model.py", line 996, in inference
    nbest_hyps = self.beam_search(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 285, in inference
    res = model.inference(**batch, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 394, in inference_with_vad
    results = self.inference(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 248, in generate
    return self.inference_with_vad(input, input_len=input_len, **cfg)
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/models/audio/funasr/model.py", line 61, in forward
    output = self.model.generate(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/models/base/base_model.py", line 35, in __call__
    return self.postprocess(self.forward(*args, **kwargs))
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/pipelines/audio/funasr_pipeline.py", line 73, in __call__
    output = self.model(*args, **kwargs)
  File "/data/tts/sovits/GPT-SoVITS/tools/asr/funasr_cantonese.py", line 35, in <module>
    rec_result = inference_pipeline(input="/data/tts/sovits/audio_res/e1/12_4.wav")
  File "/home/user/miniconda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1 but got size 2 for tensor number 1 in the list.

Code I used:

from funasr import AutoModel

path_asr  =  "iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online"
path_vad  =  "iic/speech_fsmn_vad_zh-cn-16k-common-pytorch"
path_punc =  "iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"

model = AutoModel(
    model               = path_asr,
    vad_model           = path_vad,
    vad_model_revision  = "v2.0.4",
    punc_model          = path_punc,
    punc_model_revision = "v2.0.4",
)



res = model.generate(
    input="/data/tts/sovits/audio_res/e1/12_4.wav"              # Failed 
    # input="/data/tts/sovits/audio_res/e1/12_12.wav"         # Success
)
print(res)

Here is the audio file I used: Desktop.zip

May 10 '24 04:05 kexul

The audio file which failed in my code, can be successfully processed in the online demo of modelscope: https://www.modelscope.cn/models/iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online/summary. Maybe a recent update breaked some functionality?

May 10 '24 04:05 kexul

After some digging, I found that the uniasr seems not able to handle the case where batch size > 1. When vad model is enabled and it split the audio to pieces, the error is triggered. A temporary solution is disable the vad model.

May 10 '24 07:05 kexul

@kexul 你的意思是这是vad模型的问题？不使用vad就行？

May 10 '24 07:05 linrb685

@kexul 你的意思是这是vad模型的问题？不使用vad就行？

嗯，我这边把vad关掉，就都可以跑了，你可以试试看~

May 10 '24 07:05 kexul

@kexul 多谢，我试试

May 10 '24 07:05 linrb685

@linrb685 If you still want vad and punct, you can do them manually 🤣:

import soundfile
from pathlib import Path
from funasr import AutoModel

path_asr  =  "iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online"
path_vad  =  "iic/speech_fsmn_vad_zh-cn-16k-common-pytorch"
path_punc =  "iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"

model = AutoModel(model=path_asr)
vad_model = AutoModel(model=path_vad)
punc_model = AutoModel(model=path_punc)


for item in Path('.').glob('*.wav'):
    print(str(item))
    text = model.generate(input=str(item))[0]['text']
    print(text)

    res_vad = vad_model.generate(input=str(item))[0]['value']
    wav, sr = soundfile.read(str(item))

    full_text = []
    for span in res_vad:
        wav_span = wav[int(span[0]*sr/1000):int(span[1]*sr/1000)]
        wav_temp = soundfile.write('temp.wav', wav_span, sr)
        text = model.generate(input='temp.wav')[0]['text']
        full_text.append(text)

    full_text = ' '.join(full_text)

    punc_text = punc_model.generate(input=full_text)[0]['text']
    print(punc_text)

May 10 '24 08:05 kexul

@kexul 多谢，vad对我们不是必须的。但是可以考虑加上。目前没遇到，需要多测试一下

May 10 '24 09:05 linrb685

FunASR FunASR copied to clipboard

Pytorch 张量问题

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

FunASR
FunASR copied to clipboard