Ryan McCormick

Results 159 comments of Ryan McCormick

Hi @zwei2016, You'll need to install any python dependencies necessary for your python model inside of the container before starting the server. For example, via `pip install ...`. You can...

Hi @rahchuenmonroe, This applies to input/output tensors within Triton core, before and after the model execution in the backend. If you are communicating with Triton over the network (HTTP/GRPC), then...

Can you update the PR title to be more descriptive? (cancellation, decoupled responses, etc. rather than JIRA ticket number)

Hi @geraldstanje, Thanks for raising this issue. I believe this error generally indicates a version mismatch issue: > [TensorRT-LLM][ERROR] Assertion failed: d == a + length You mentioned the following...

Hi @geraldstanje, Triton 24.02 + TRTLLM v0.8.0 should work. The 7b models should likely fit on a single GPU with 24GB memory, but you can use tensor parallelism to split...

I don't believe the Ubuntu 20.04 host should be an issue, as the container will have the required Ubuntu 22.04 inside. As for the CUDA/driver version, see this note from...

Hi @geraldstanje, for questions about running the engine directly (outside of Triton) via `run.py` and specific details of the standalone engine performance, I would reach out in the TRT-LLM github...

@fpetrini15 @krishung5 do you know anything about these multi-gpu engine build warnings? My assumption is that this is saying multi-gpu performance may be degraded without direct p2p access like NVLink,...

Hi @jamied157, Thanks for such detailed repro steps and investigation! I had a quick follow-up question from your description so far. You mentioned that you can reproduce this without request...

Another question @jamied157 @HennerM - given Henner's [proposed fix](https://github.com/triton-inference-server/core/pull/341) for the sequence batcher issue for a standalone sequence model (no ensemble), and: > My Repro above can cause a slightly...