Kris Hung

Results 101 comments of Kris Hung

Hi @apokerce, would be great if you could confirm if OOM still happens. From my end, using the http client doesn't introduce any memory growth: ![image](https://github.com/triton-inference-server/server/assets/43719498/6f19a772-4446-4067-9463-ff1024696cdd) For using grpc client...

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

Hi @Gcstk, thanks for bringing this up. There will be some API changes and fixes needed if you'd like to compile the TRT backend with TRT 10. I'd recommend waiting...

@mc-nv I think the changes in this PR are merged within other previous PR. Should we close this one?

It looks like your GPU doesn't support peer-to-peer access. Could you run `nvidia-smi topo -m` to see if that's the case? I did have a similar issue before where my...

@geraldstanje I think it might also require nvlinks for p2p access - not sure about this part, should have more clarification from the TRT-LLM GitHub channel. From my experience, I...

@geraldstanje Sure thing! I'm using the command in the [README](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#prepare-tensorrt-llm-engines) as example. Basically just adding the last line when building engines: ``` # Build TensorRT engines trtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \...

Hi @mfournioux, I was wondering if the same GPUs are available to both containers, and if they are run on the same machine?

I think CUDA shared memory can only be used on the same GPU. Can you try to deploy both client and server on the same GPU device?