Yuekai Zhang comments

Results 130 comments of


                                            Yuekai Zhang

RTFX and Latency numbers for streaming pruned transducer stateless X

> 1. Throughput is not RTFx. Throughput computation is a little bit complex. 2. Difference between perf_analyzer and https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py: perf_analyzer --streaming using a single wav file, however, client.py could use...

RTFX and Latency numbers for streaming pruned transducer stateless X

> Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time....

RTFX and Latency numbers for streaming pruned transducer stateless X

For warmup https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-warmup For stats.json https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md For stats_summary.txt, I just convert it from stats.json. e.g. "batch_size 19, 18 times, infer 7875.14 ms, avg 437.51 ms, 23.03 ms input 47.86 ms,...

RTFX and Latency numbers for streaming pruned transducer stateless X

We have not started benchmark and profilling yet. How do you cofing your warmup setting? Also, later we will support tensorrt backend which should have less time to warmup comparing...

RTFX and Latency numbers for streaming pruned transducer stateless X

> @yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF...

any plans for faster whisper integration in onnx+triton?

@haiderasad We have no plan to integrate faster whisper. I recommand to try whisper TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper), which is the current fastest implementation according to https://github.com/shashikg/WhisperS2T?tab=readme-ov-file#benchmark-and-technical-report.

any plans for faster whisper integration in onnx+triton?

See #551. @haiderasad

ValueError: ChatGLMForConditionalGeneration does not support gradient checkpointing.

> 我也遇到了同样的问题，请问最后是如何解决的 https://github.com/yanqiangmiffy/InstructGLM/issues/1#issuecomment-1482778224

Triton streaming support for old zipformer(pruned stateless 7 streaming)

> Hi @csukuangfj @yuekaizhang > > Here are some notes based on my understanding: > > * These _cache are actually implicit states defined in Nvidia Triton, that are used...

Triton streaming support for old zipformer(pruned stateless 7 streaming)

> I started working on it, but I am a bit confused about 1 thing. > > In https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/export.py#L291 I see you already have onnx script for streaming zipformer right?...