TensorRT-LLM [RFC]Topics you want to discuss with TensorRT-LLM team in the upcoming meet-ups

Dear Community Members,

The just finished "github firstly" move of TensorRT-LLM can facilitate more seamless interactions with our developer community and enhance collaborative innovation.

To further improve information transparency, we are planning to organize a series of online meet-ups. These sessions will focus on sharing the latest developments in TensorRT-LLM and discussing technical topics that may be of particular interest to the community.

We warmly invite you to:

Suggest discussion topics you would like us to address in upcoming sessions
Share your valuable feedback to help us improve

Your insights will be instrumental in advancing LLM inference solutions on NVIDIA GPUs and driving the evolution of accelerated computing.

Thanks, The TensorRT-LLM Engineering Team

Mar 27 '25 13:03 juney-nvidia

Can we suggest open sourcing all of the kernels as a topic to discuss? :)

It will be easier to adapt the library to new models with unique architectures if there are no magical black box parts we can't edit. Nothing is more frustrating than being able to almost do a thing except you'd need to change 3-4 lines in one of the libraries that are compiled for no particular reason so now the whole thing goes nowhere.

As an example, take the recent thing with Cohere2 requiring disabling cyclic kv cache... in that situation it seems I was lucky that you guys were also interested in supporting it in the closed source kernels, but what if you hadn't? There would have been no way for anyone else to properly support the model.

That includes: internal_cutlass_kernels nvrtc_wrapper FMHA XQA trtllmGen

Ideally there would be no .so or cubin files in the repo... at the end of the day people will use Nvidia hardware to run this, so where is the problem sharing the code? 👀

Mar 27 '25 16:03 aikitoria

Can we suggest open sourcing all of the kernels as a topic to discuss? :)

It will be easier to adapt the library to new models with unique architectures if there are no magical black box parts we can't edit. Nothing is more frustrating than being able to almost do a thing except you'd need to change 3-4 lines in one of the libraries that are compiled for no particular reason so now the whole thing goes nowhere.

As an example, take the recent thing with Cohere2 requiring disabling cyclic kv cache... in that situation it seems I was lucky that you guys were also interested in supporting it in the closed source kernels, but what if you hadn't? There would have been no way for anyone else to properly support the model.

That includes: internal_cutlass_kernels nvrtc_wrapper FMHA XQA trtllmGen

Ideally there would be no .so or cubin files in the repo... at the end of the day people will use Nvidia hardware to run this, so where is the problem sharing the code? 👀

Thanks for sharing the feedback @aikitoria. We deeply appreciate the community’s passion for transparency and customization, which aligns with our own commitment to fostering a collaborative ecosystem.

At NVIDIA, we’ve intentionally open-sourced a significant portion of TensorRT-LLM’s kernels (inside the kernels directory, lots of customized kernels used in TensorRT-LLM have already been open-sourced, though not all) to empower community experimentation. We’re actively working to expand this scope further based on community needs, and your example of the cyclic KV cache adjustment is precisely the kind of input that helps us prioritize efforts. That said, decisions around kernel availability involve a balance between providing accessible interfaces for customization and maintaining sustainable development practices, which occasionally requires internal reviews and phased implementation.

To address your core concern: While not every component can be immediately open-sourced due to organizational policies and long-term support considerations, we’re heavily investing in two fronts:

Community-driven prioritization: Using feedback like yours to accelerate the release of kernels critical for emerging architectures.
Extensibility-first design: Improving modular interfaces to reduce reliance on "black box" components, even when low-level code isn’t exposed.

The presence of compiled binaries (e.g., .so/cubin) in certain performance-critical modules reflects optimization efforts tailored to NVIDIA hardware, but we’re continuously evaluating opportunities to expose more implementation details where feasible. We encourage contributors to file GitHub issues for specific blockers—this helps us advocate for prioritization internally.

Your frustration is entirely valid, and we’re committed to evolving TensorRT-LLM in a way that respects both community innovation and the realities of maintaining enterprise-grade software. Let’s keep this dialogue open as we work toward solutions together!

Thanks June

Mar 29 '25 10:03 juney-nvidia

The presence of compiled binaries (e.g., .so/cubin) in certain performance-critical modules reflects optimization efforts tailored to NVIDIA hardware

I'm not sure I understand the reasoning here, why couldn't it be optimized for Nvidia hardware and have the code available to edit? 👀 Isn't this just cuda code compiled with the same nvcc we have?

Having the code available doesn't require it to be less optimized, perhaps it might even result in it becoming more optimized over time! There are a lot of people out there interested in experimenting with whether they can make the thing faster on their specific Nvidia gpu...

Mar 29 '25 12:03 aikitoria

When is the next meeting?

Mar 31 '25 21:03 WilliamTambellini

Where can we find the roadmap?

Apr 08 '25 02:04 DwenGu

When is the next meeting?

The first online meet-up will be arranged in the end of April, in which we will introduce the latest status of PyTorch-centric re-architecture of TensorRT-LLM.

Thanks June

Apr 11 '25 12:04 juney-nvidia

When is the next meeting?

We are working with the prod team to prepare it, @laikhtewari .

When it becomes ready, we will share to the public.

Thanks June

Apr 11 '25 12:04 juney-nvidia

I’d like to suggest two topics for discussion in the upcoming meet-ups:

Getting Started with TensorRT-LLM: A beginner-friendly guide on how new contributors can start learning about TensorRT-LLM and get involved, including an overview of the architecture and highlighting issues labeled as “good first issue”.
PyTorch Backend Optimizations: A technical deep dive into the PyTorch backend—specifically, the optimizations introduced by the TensorRT team to accelerate inference on NVIDIA GPUs.

Apr 25 '25 05:04 ankitmaurya001

I’d like to suggest two topics for discussion in the upcoming meet-ups:

Getting Started with TensorRT-LLM: A beginner-friendly guide on how new contributors can start learning about TensorRT-LLM and get involved, including an overview of the architecture and highlighting issues labeled as “good first issue”.

PyTorch Backend Optimizations: A technical deep dive into the PyTorch backend—specifically, the optimizations introduced by the TensorRT team to accelerate inference on NVIDIA GPUs.

There is a recent sharing about TensorRT-LLM which covers some of the topics you asked here:

https://forums.developer.nvidia.com/t/beyond-the-algorithm-the-new-pytorch-architecture-for-tensorrt-llm/331008

June

Apr 27 '25 19:04 juney-nvidia

@juney-nvidia

Enjoyed the talk and the thorough explanation of tensorrt pytorch design.

Can you share the slides?

Jun 27 '25 16:06 jeromeku

Is there going to be another TensorRT-LLM meetup?

Sep 26 '25 00:09 kevinlu1248