FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Pytorch 张量问题

Open linrb685 opened this issue 1 year ago • 9 comments

❓ Questions and Help

What is your question?

我的问题是,在使用过程中,突然报错了: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list

Code

` model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", # spk_model="cam++" )

res = model.generate(input=output_path, batch_size_s=300, hotword='魔搭') print(res) `

What have you tried?

使用过一段时间,一开始可以正常识别出语音,但是突然报错,未重启脚本,因为要排查这个问题。 尝试过之前可识别的音频,重新调用识别,但是一样报错,网上查过说是:Pytorch 张量问题。但是这个在我自己的代码中并没有体现。

What's your environment?

  • OS: win10
  • FunASR Version: 1.0.25
  • ModelScope Version : 1.14.0
  • PyTorch Version: pytorch-wpe 版本0.0.1
  • How you installed funasr (pip, source): 使用pip install 直接安装
  • Python version: 3.11.7 使用的是cpu,未使用gpu

麻烦大佬帮忙看下~

linrb685 avatar May 10 '24 03:05 linrb685

Please show detail logs of error. Upload the wav file.

LauraGPT avatar May 10 '24 03:05 LauraGPT

I got the same issue here, when using the cantonese model, here is the full log @LauraGPT :

Sizes of tensors must match except in dimension 2. Expected size 1 but got size 2 for tensor number 1 in the list.
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/scama/decoder.py", line 457, in forward_one_step
    x = torch.cat((x, pre_acoustic_embeds), dim=-1)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/scama/decoder.py", line 419, in score
    logp, state = self.forward_one_step(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 176, in score_full
    scores[k], states[k] = d.score(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 309, in search
    scores, states = self.score_full(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/beam_search.py", line 410, in forward
    best = self.search(
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/models/uniasr/model.py", line 996, in inference
    nbest_hyps = self.beam_search(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 285, in inference
    res = model.inference(**batch, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 394, in inference_with_vad
    results = self.inference(
  File "/home/user/miniconda/lib/python3.9/site-packages/funasr/auto/auto_model.py", line 248, in generate
    return self.inference_with_vad(input, input_len=input_len, **cfg)
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/models/audio/funasr/model.py", line 61, in forward
    output = self.model.generate(*args, **kwargs)
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/models/base/base_model.py", line 35, in __call__
    return self.postprocess(self.forward(*args, **kwargs))
  File "/home/user/miniconda/lib/python3.9/site-packages/modelscope/pipelines/audio/funasr_pipeline.py", line 73, in __call__
    output = self.model(*args, **kwargs)
  File "/data/tts/sovits/GPT-SoVITS/tools/asr/funasr_cantonese.py", line 35, in <module>
    rec_result = inference_pipeline(input="/data/tts/sovits/audio_res/e1/12_4.wav")
  File "/home/user/miniconda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/miniconda/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1 but got size 2 for tensor number 1 in the list.

Code I used:

from funasr import AutoModel

path_asr  =  "iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online"
path_vad  =  "iic/speech_fsmn_vad_zh-cn-16k-common-pytorch"
path_punc =  "iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"

model = AutoModel(
    model               = path_asr,
    vad_model           = path_vad,
    vad_model_revision  = "v2.0.4",
    punc_model          = path_punc,
    punc_model_revision = "v2.0.4",
)



res = model.generate(
    input="/data/tts/sovits/audio_res/e1/12_4.wav"              # Failed 
    # input="/data/tts/sovits/audio_res/e1/12_12.wav"         # Success
)
print(res)

Here is the audio file I used: Desktop.zip

kexul avatar May 10 '24 04:05 kexul

The audio file which failed in my code, can be successfully processed in the online demo of modelscope: https://www.modelscope.cn/models/iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online/summary. Maybe a recent update breaked some functionality?

kexul avatar May 10 '24 04:05 kexul

After some digging, I found that the uniasr seems not able to handle the case where batch size > 1. When vad model is enabled and it split the audio to pieces, the error is triggered. A temporary solution is disable the vad model.

kexul avatar May 10 '24 07:05 kexul

@kexul 你的意思是这是vad模型的问题?不使用vad就行?

linrb685 avatar May 10 '24 07:05 linrb685

@kexul 你的意思是这是vad模型的问题?不使用vad就行?

嗯,我这边把vad关掉,就都可以跑了,你可以试试看~

kexul avatar May 10 '24 07:05 kexul

@kexul 多谢,我试试

linrb685 avatar May 10 '24 07:05 linrb685

@linrb685 If you still want vad and punct, you can do them manually 🤣:

import soundfile
from pathlib import Path
from funasr import AutoModel

path_asr  =  "iic/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online"
path_vad  =  "iic/speech_fsmn_vad_zh-cn-16k-common-pytorch"
path_punc =  "iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch"

model = AutoModel(model=path_asr)
vad_model = AutoModel(model=path_vad)
punc_model = AutoModel(model=path_punc)


for item in Path('.').glob('*.wav'):
    print(str(item))
    text = model.generate(input=str(item))[0]['text']
    print(text)

    res_vad = vad_model.generate(input=str(item))[0]['value']
    wav, sr = soundfile.read(str(item))

    full_text = []
    for span in res_vad:
        wav_span = wav[int(span[0]*sr/1000):int(span[1]*sr/1000)]
        wav_temp = soundfile.write('temp.wav', wav_span, sr)
        text = model.generate(input='temp.wav')[0]['text']
        full_text.append(text)

    full_text = ' '.join(full_text)

    punc_text = punc_model.generate(input=full_text)[0]['text']
    print(punc_text)

kexul avatar May 10 '24 08:05 kexul

@kexul 多谢,vad对我们不是必须的。但是可以考虑加上。目前没遇到,需要多测试一下

linrb685 avatar May 10 '24 09:05 linrb685