TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding`

Open Sanger2000 opened this issue 2 years ago • 0 comments

In nvidia-ammo, it appears these lines in ammo/torch/export/layer_utils.py have an unexpected failure for some Llama variants:

In particular, the deepseek models use LlamaLinearScalingRotaryEmbedding. This means the module is picked up by the is_linear check, and is treated as the dense case. However, there is no .weight for this module, so the build_linear_config fails.

Lots of easy fixes for this (for example, just checking if "Rotary" in name and skipping that case), happy to contribute (but don't think there is an OSS repo to do so)

Feb 11 '24 07:02 Sanger2000