juney-nvidia

Results 117 comments of juney-nvidia

Hi @mklachko For NVFP4, it introduces two-levels quantization method, the first top-level is the per-Tensor quantization scaling factor, the second level is the fine-grained blockwise quantization scaling factor and yes...

> [@juney-nvidia](https://github.com/juney-nvidia) thanks! Why do we need the two levels of scaling factors? Is it because of the limited range of FP8? Can you please point me to the relevant...

@MartinMarciniszyn for vis. + @zeroepoch @chzblych @mk-nvidia for vis. @aspctu Hi, we are working to publish pre-built containers to the public to ease the efforts of community usage, and it...

@kaiyux to make sure he is aware of the addition of AutoDeploy as another backend of trtllm-bench. Thanks June

@Fridah-nv Hi Frida, Since this is just a doc change, to save your time, after you finish the refinement based on Lucas's feedback, you can just run "bot skip" command...

@chuangz0 @xiaoweiw-nv @pcastonguay pls help review this MR. Thanks June

@liquanfeng pls rebase this MR with the latest main branch. @Barry-Delaney pls help review this MR when it gets ready. Thanks June

@chuangz0 @xiaoweiw-nv @pcastonguay can you help review this PD related MR? Thanks June

Also keeping @pcastonguay @schetlur-nv for vis about this UCX backend support MR for dis-agg serving. Thanks June

@jeffye-dev Hi, The 253 perf number is generated based on [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3#running-the-benchmark) command. Also, to achieve the 253 perf number, some needed MRs are being prepared to get merged to the...