Results 129 comments of Yuekai Zhang

> hello! I build int8 weights: INFERENCE_PRECISION=float16 WEIGHT_ONLY_PRECISION=int8 MAX_BEAM_WIDTH=4 MAX_BATCH_SIZE=8 checkpoint_dir=whisper_large_v3_weights_${WEIGHT_ONLY_PRECISION} output_dir=whisper_large_v3_${WEIGHT_ONLY_PRECISION} > > # Convert the large-v3 model weights into TensorRT-LLM format. > python3 convert_checkpoint.py --use_weight_only --weight_only_precision $WEIGHT_ONLY_PRECISION --output_dir...

@Plemeur Hi, you need to detect language first. Then set text prefix to your detected language. You can't do it by setting prompt only.

@hjaved202 The Whisper Trt-llm solution only provides support for the forward pass of the Whisper model's encoder and decoder, as well as beam search. During decoding, users need to set...

@xqun3 https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/log/stats_summary.txt 可以跑跑这个项目里的 client 调试一下,通过上面生成的文件看看实际推理的 batch 和 配置之间的关系。 另外最近会更新 inflight batching 的支持,会在现在的代码基础上再提升 20% 以上的吞吐,可以关注一下

> Hi @yuekaizhang,感谢分享代码,很棒的工作! > > 但是我在实际部署使用时发现一个问题,模型在部署以后,发起并发调用,并没有看到batch的效果,而是按照并发的大小推理时间成倍增加,是因为本身的实现并不支持triton组batch?我的batch相关配置如下: > > ``` > dynamic_batching { > preferred_batch_size: [ 4, 8] > max_queue_delay_microseconds: 100 > } > ``` @xqun3 还要检查一下 client 端发送的音频是不是长度都是一样的,如果不一样需要统一 padding 到30秒,不然不会组到一个...

Hi @krishnardt, whisper-triton is an acclerated solution which can't improve whisper's accuracy. If you can't get correct results using pytorch whisper implementation, whisper-triton can't help either.

Try `Hotwords: Atomberg` as text prefix to see if it could work.

@krishnardt See https://huggingface.co/openai/whisper-large-v3/blob/main/config.json#L36. You may change your build commands for decoder. ``` trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \ --output_dir ${output_dir}/decoder \ --moe_plugin disable \ --max_beam_width ${MAX_BEAM_WIDTH} \ --max_batch_size ${MAX_BATCH_SIZE} \ --max_seq_len 448...

> The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples @thefacetakt Would you...