TensorRT-LLM
TensorRT-LLM copied to clipboard
What is the best way to get a sub-tensor without data copy?
Before the attention operation the qkv tensors are implemented as one big tensor qkv
, I would like to do some in-place operations for q and k only.
Currently what I do is following the code here query, key, value = split(qkv, [self.attention_hidden_size, kv_size, kv_size], dim=2)
to split the tensor, but I saw in the comment that it actually requires memory copy The slice layer selects for each dimension a start location from within the input tensor, and copies elements to the output tensor using a stride of 1 across the input tensor.
which is not ideal in my case.
Wondering is there a way to get the sub-tensor without data copy? or trt-llm will optimize away the unnecessary data copies?