juney-nvidia
juney-nvidia
> When is the next meeting? The first online meet-up will be arranged in the end of April, in which we will introduce the latest status of PyTorch-centric re-architecture of...
> When is the next meeting? We are working with the prod team to prepare it, @laikhtewari . When it becomes ready, we will share to the public. Thanks June
> I’d like to suggest two topics for discussion in the upcoming meet-ups: > > * Getting Started with TensorRT-LLM: A beginner-friendly guide on how new contributors can start learning...
@kaiyux @Kefeng-Duan for vis on this question from the community. @laikhtewari for vis also. June
> ``` > trtllm-serve nvidia/DeepSeek-R1-FP4 \ > --max_batch_size 256 --max_num_tokens 32768 \ > --max_seq_len 32768 --kv_cache_free_gpu_memory_fraction 0.95 \ > --host 0.0.0.0 --port 30001 --trust_remote_code --backend pytorch --tp_size 8 --ep_size 8...
> Great to hear this! [@juney-nvidia](https://github.com/juney-nvidia), do we have a plan to setup EP partition analytic models ? > > It is generallly believed that EP should be evenly distributed...
Hi @khayamgondal We have some performance study before of offloading KV cache to CPU and the finding at that time tells us that there isn't perf gain, so we only...
> Thanks, June I'm working on a study to understand how much hit performance > takes when part of the inference process (KV cache in this scenario) is > offloaded...
> Thanks [@juney-nvidia](https://github.com/juney-nvidia) I am looking at `KvCacheConfig `class and wondering if I set the following to 0, would this force not to use GPU for KV cache? > >...
@lucaslie Thanks for improving the dev productivity! @niukuo Since this is container related change, can you also help review this MR? June