TensorRT-LLM
TensorRT-LLM copied to clipboard
FP8Linear.forward cannot be called twice
TLDR: When trying to forward FP8Linear layer twice, and error occurs
...
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 892, in forward
alpha = self.weights_scaling_factor.raw_value * self.activation_scaling_factor.raw_value
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in raw_value
assert isinstance(
AssertionError: Must be np.ndarray. Proper usage: get parameter.raw_value before getting parameter.value
Underlying reason is a .raw_value acces:
First, there is a call accessing raw_value of scaling factor parameters
alpha = self.weights_scaling_factor.raw_value * self.activation_scaling_factor.raw_value
And after it there are calls 1 2 to .value of these parameters.
And a call to .value rewrites parameter's ._value with a constant, which prohibits further use of .raw_value.
Hence, it is not possible to call FPLinear forward twice. Seems like a bug.
(It is currently needed, for example, in cross-Attention layer, where first we call self.qkv(hidden_states), and then self.qkv(encoder_output))