Yuekai Zhang comments

Results 129 comments of


                                            Yuekai Zhang

What about int8 weighs?

> hello! I build int8 weights: INFERENCE_PRECISION=float16 WEIGHT_ONLY_PRECISION=int8 MAX_BEAM_WIDTH=4 MAX_BATCH_SIZE=8 checkpoint_dir=whisper_large_v3_weights_${WEIGHT_ONLY_PRECISION} output_dir=whisper_large_v3_${WEIGHT_ONLY_PRECISION} > > # Convert the large-v3 model weights into TensorRT-LLM format. > python3 convert_checkpoint.py --use_weight_only --weight_only_precision $WEIGHT_ONLY_PRECISION --output_dir...

Whisper without translation

@Plemeur Hi, you need to detect language first. Then set text prefix to your detected language. You can't do it by setting prompt only.

Whisper without translation

@hjaved202 The Whisper Trt-llm solution only provides support for the forward pass of the Whisper model's encoder and decoder, as well as beam search. During decoding, users need to set...

Can I use triton server tensorrtllm backend to host other tensorrt built models? If not what do you suggest if our models stack is mixed of LLM and non-LLM models

See https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/whisper.md

whisper 并发推理问题

@xqun3 https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/log/stats_summary.txt 可以跑跑这个项目里的 client 调试一下，通过上面生成的文件看看实际推理的 batch 和配置之间的关系。另外最近会更新 inflight batching 的支持，会在现在的代码基础上再提升 20% 以上的吞吐，可以关注一下

whisper 并发推理问题

> Hi @yuekaizhang，感谢分享代码，很棒的工作！ > > 但是我在实际部署使用时发现一个问题，模型在部署以后，发起并发调用，并没有看到batch的效果，而是按照并发的大小推理时间成倍增加，是因为本身的实现并不支持triton组batch？我的batch相关配置如下： > > ``` > dynamic_batching { > preferred_batch_size: [ 4, 8] > max_queue_delay_microseconds: 100 > } > ``` @xqun3 还要检查一下 client 端发送的音频是不是长度都是一样的，如果不一样需要统一 padding 到30秒，不然不会组到一个...

Few domain related terminologies are not transcribed correctly in whisper-triton.

Hi @krishnardt, whisper-triton is an acclerated solution which can't improve whisper's accuracy. If you can't get correct results using pytorch whisper implementation, whisper-triton can't help either.

Few domain related terminologies are not transcribed correctly in whisper-triton.

Try `Hotwords: Atomberg` as text prefix to see if it could work.

Few domain related terminologies are not transcribed correctly in whisper-triton.

@krishnardt See https://huggingface.co/openai/whisper-large-v3/blob/main/config.json#L36. You may change your build commands for decoder. ``` trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \ --output_dir ${output_dir}/decoder \ --moe_plugin disable \ --max_beam_width ${MAX_BEAM_WIDTH} \ --max_batch_size ${MAX_BATCH_SIZE} \ --max_seq_len 448...

enc_dec: prompt_embedding_table not passed to encoder model

> The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples @thefacetakt Would you...