server icon indicating copy to clipboard operation
server copied to clipboard

Improve documentation regarding input tensor memory types and device IDs

Open acmorrow opened this issue 7 months ago • 1 comments

Is your feature request related to a problem? Please describe. The memory management API as available in triton/core/tritonserver.h is missing some clarity/documentation about when input tensors should be attached with a memory type other than CPU. In particular, if I currently have data on normal CPU memory that I want to use for inference against a model on a GPU instance, is it my responsibility as the caller of the Triton API to allocate GPU memory and copy it over before attaching and starting inference? Or should I be leaving such data on CPU memory and trusting Triton to move the data to a GPU for me. I'm hoping it is this latter case, because it isn't clear at all how the caller can know which device to allocate from / copy to when there is more than one GPU.

Describe the solution you'd like The examples like https://github.com/triton-inference-server/server/blob/main/src/simple.cc show explicit allocation onto the GPU with cudaMalloc and explicitly placing input tensors onto the GPU with cudaMemcpy, etc. The implication is that code using the C++ API should also be doing this for input tensors. But it isn't clear how that should work for a situation with multiple GPUs, where models may have instances on more than one GPU. I'd assume that the Triton server is better positioned to know which GPU should get the data for a given inference request.

Please add some documentation to tritonserver.h that makes it easier to understand the circumstances under which input tensors should be attached with a specific memory type and memory type id. I think the answer is that if I know that some region of memory already represents GPU memory on a particular device, and some other system has already placed the input data there, then that information should be propagated correctly by setting the memory type and id appropriately.

Basically, I think the documentation should clarify the correct workflow for optimal inference in the very common case where:

  • The input data is currently on ordinary CPU memory
  • The model is running on one or more GPUs

I guess another way of framing this question would be to ask whether it is somehow disadvantageous to leave the memory type set to CPU when I have data on CPU but I'm aiming for GPU accelerated inference, such that I would do better to explicitly handle moving the data to GPU myself and then invoking inference with the memory type set to GPU and a memory ID selected. If so, how can I know what GPU to allocate from / copy to when I have the model instantiated on more than one GPU? If it isn't disadvantageous, I think that should be made clear in the documentation, as it considerably simplifies the programming model to be able to ignore this aspect for the common case.

Describe alternatives you've considered None

acmorrow avatar May 19 '25 21:05 acmorrow

Just a quick clarification: I'm writing my own wrapper around the Triton core via the C API in tritonserver.h, so things like the Triton gRPC server or HTTP server aren't in play here. Any special stuff those things or associated clients can do w.r.t. placing data on the GPU in one process and then having Triton use it from another aren't in play for my question. This is strictly about the programming interface to the triton core.

acmorrow avatar May 19 '25 21:05 acmorrow