Kris Hung
Kris Hung
Hi @apokerce, would be great if you could confirm if OOM still happens. From my end, using the http client doesn't introduce any memory growth:  For using grpc client...
Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.
Hi @Gcstk, thanks for bringing this up. There will be some API changes and fixes needed if you'd like to compile the TRT backend with TRT 10. I'd recommend waiting...
@mc-nv I think the changes in this PR are merged within other previous PR. Should we close this one?
It looks like your GPU doesn't support peer-to-peer access. Could you run `nvidia-smi topo -m` to see if that's the case? I did have a similar issue before where my...
@geraldstanje I think it might also require nvlinks for p2p access - not sure about this part, should have more clarification from the TRT-LLM GitHub channel. From my experience, I...
@geraldstanje Sure thing! I'm using the command in the [README](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#prepare-tensorrt-llm-engines) as example. Basically just adding the last line when building engines: ``` # Build TensorRT engines trtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \...
CC @oandreeva-nv for vis.
Hi @mfournioux, I was wondering if the same GPUs are available to both containers, and if they are run on the same machine?
I think CUDA shared memory can only be used on the same GPU. Can you try to deploy both client and server on the same GPU device?