FireRedASR 当开启fp16时 LLM模型推理的结果异常

我的GPU卡是32GB显存，运行LLM的推理脚本报内存不足，开启fp16后，识别结果出现了异常：

CUDA_VISIBLE_DEVICES=0
speech2text.py --asr_type llm --model_dir /asr/kell/FireRedASR/examples/pretrained_models/FireRedASR-LLM-L --use_fp16 1 --batch_size 1 --beam_size 3 --decode_max_len 0 --decode_min_len 0 --repetition_penalty 3.0 --llm_length_penalty 1.0 --temperature 1.0 --wav_scp wav/wav.scp --output out/llm-l-asr.txt Namespace(asr_type='llm', model_dir='/asr/kell/FireRedASR/examples/pretrained_models/FireRedASR-LLM-L', use_fp16=True, wav_path=None, wav_paths=None, wav_dir=None, wav_scp='wav/wav.scp', output='out/llm-l-asr.txt', use_gpu=1, batch_size=1, beam_size=3, decode_max_len=0, nbest=1, softmax_smoothing=1.0, aed_length_penalty=0.0, eos_penalty=1.0, decode_min_len=0, repetition_penalty=3.0, llm_length_penalty=1.0, temperature=1.0) #wavs=4 model args: Namespace(input_length_max=30.0, input_length_min=0.1, output_length_max=150, output_length_min=1, freeze_encoder=0, encoder_downsample_rate=2, freeze_llm=0, use_flash_attn=0, use_lora=1, unk='', use_fp16=1, encoder_path='/asr/kell/FireRedASR/examples/pretrained_models/FireRedASR-LLM-L/asr_encoder.pth.tar', llm_dir='/asr/kell/FireRedASR/examples/pretrained_models/FireRedASR-LLM-L/Qwen2-7B-Instruct') Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.80it/s] trainable params: 161,480,704 || all params: 7,777,097,216 || trainable%: 2.0764 /home/kell/anaconda3/envs/fireredasr/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:650: UserWarning: do_sample is set to False. However, top_k is set to 20 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_k. warnings.warn( {'uttid': 'BAC009S0764W0121', 'text': '"""""""""""""""""""""""""""""""""""""""""""""""""""%', 'wav': 'wav/BAC009S0764W0121.wav', 'rtf': '1.3632'} {'uttid': 'IT0011W0001', 'text': '""""""""""""""""""""""""%', 'wav': 'wav/IT0011W0001.wav', 'rtf': '0.9503'} {'uttid': 'TEST_NET_Y0000000000_-KTKHdZ2fb8_S00000', 'text': '"""""""""""""""""""""%', 'wav': 'wav/TEST_NET_Y0000000000_-KTKHdZ2fb8_S00000.wav', 'rtf': '0.9159'} {'uttid': 'TEST_MEETING_T0000000001_S00000', 'text': '"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""%', 'wav': 'wav/TEST_MEETING_T0000000001_S00000.wav', 'rtf': '0.9114'}
ref=wav/text
wer.py --print_sentence_wer 1 --do_tn 0 --rm_special 1 --ref wav/text --hyp out/llm-l-asr.txt
tail -n8 out/llm-l-asr.txt.wer

ref 89 sub 4 del 85 ins 0 WER100.00 sub 4.49 del 95.51 ins 0.00 SER100.00 = 4 / 4

English #word=0, #correct=0 Digit #word=0, #correct=0

Feb 17 '25 15:02 kellkwang

正常了吗？

Feb 20 '25 13:02 FireRedTeam

@FireRedTeam llm推理，用了默认的配置，aed能复现结果，但是llm出现了很多重复解码的case（llm使用fp16推理）

Feb 21 '25 02:02 lzl-mt

使用bfloat16替代就可以了

Mar 03 '25 02:03 yangxjzwd1

目前FireRedASR-LLM-L的模型不是标准的Huggingface transformers结构，其自定义模型加载过程的源码在fireredasr_llm.py中，它的自定义模型加载实现是通过args.use_flash_attn、args.use_fp16控制，当前这二者的参数都为0，默认使用torch.float32，在GPU测试环境RTX 3090上无法完成推理，提示CUDA OutOfMemory；考虑到FireRedASR-LLM-L模型为tar格式无法修改，所以直接修改FireRedASR的源码fireredasr.py，设置use_fp16=1，让其使用torch.float16来进行推理；不过float16推理的情况下，返回的text均为%，在github上发现已存在这个issue开启fp16推理结果异常修改FireRedASR源码fireredasr_llm.py，固定推理inference_dtype=torch.bfloat16，成功完成推理流程 inference_dtype = torch.bfloat16 # Build LLM llm = AutoModelForCausalLM.from_pretrained( args.llm_dir, attn_implementation=attn_implementation, torch_dtype=inference_dtype, )

Sep 17 '25 09:09 AlanInAction

FireRedASR FireRedASR copied to clipboard

当开启fp16时 LLM模型推理的结果异常

ref 89 sub 4 del 85 ins 0 WER100.00 sub 4.49 del 95.51 ins 0.00 SER100.00 = 4 / 4

FireRedASR
FireRedASR copied to clipboard