TensorRT-LLM feat: [AutoDeploy] DeepseekV3 e2e support with sdpa attention

Support deepseekv3 e2e example without attention forward patch

[x] Modify "TritonWithFlattenedInputs" backend to support sdpa-style attention with different head dims for v_head_dim and qk_head_dim
[x] Add unit tests for deepseek e2e example with triton kernels (single and multigpu case) with skip_loading_weights set to True
TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights
TODO: Use scale passed instead of default scale for attention op

May 14 '25 02:05 sugunav14

TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights

I think we currently don't have example support for quantized model not provided by ModelOpt, are we planning to support deepseek-ai/DeepSeek-V3 or ModelOpt quantized version of this model?

May 15 '25 19:05 Fridah-nv

I wonder if this change enables deepseek-ai/DeepSeek-R1 to run as well?

May 15 '25 20:05 Fridah-nv

@sugunav14, what's the issue with fp8 weight loading?

May 15 '25 23:05 lucaslie

@sugunav14, what's the issue with fp8 weight loading?

DeepseekV3 weights are in fp8 on huggingface. Since we have the load_state_dict() patch in place now it loads the weights in fp8 and causes a dtype mismatch while running forward. I think currently we only support modelopt quantized models as @Fridah-nv mentioned.

May 15 '25 23:05 sugunav14

Merged in this MR

Jun 05 '25 23:06 sugunav14