jiahanc
jiahanc
root causes of the issue 1. The flashinfer FP4 TRTLLM-GEN MOE originally only support 2 routing methods: DeepSeek_V3 and Llama4, so the routing is hardcoded to these 2. After https://github.com/vllm-project/vllm/pull/27492...
Hi @byStander9 , Thanks for the question. If you change the scheduling policy by settting`exec_settings["settings_config"]["scheduler_policy"]` , it will change the scheduling policy during inference, because this is an initialization param...
Hi @byStander9 , The requests' latency are recorded in this [request_latencies](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/reporting.py#L84).