TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

feat: [AutoDeploy] DeepseekV3 e2e support with sdpa attention

Open sugunav14 opened this issue 7 months ago • 4 comments

Support deepseekv3 e2e example without attention forward patch

  • [x] Modify "TritonWithFlattenedInputs" backend to support sdpa-style attention with different head dims for v_head_dim and qk_head_dim
  • [x] Add unit tests for deepseek e2e example with triton kernels (single and multigpu case) with skip_loading_weights set to True
  • TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights
  • TODO: Use scale passed instead of default scale for attention op

sugunav14 avatar May 14 '25 02:05 sugunav14

TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights

I think we currently don't have example support for quantized model not provided by ModelOpt, are we planning to support deepseek-ai/DeepSeek-V3 or ModelOpt quantized version of this model?

Fridah-nv avatar May 15 '25 19:05 Fridah-nv

I wonder if this change enables deepseek-ai/DeepSeek-R1 to run as well?

Fridah-nv avatar May 15 '25 20:05 Fridah-nv

@sugunav14, what's the issue with fp8 weight loading?

lucaslie avatar May 15 '25 23:05 lucaslie

@sugunav14, what's the issue with fp8 weight loading?

DeepseekV3 weights are in fp8 on huggingface. Since we have the load_state_dict() patch in place now it loads the weights in fp8 and causes a dtype mismatch while running forward. I think currently we only support modelopt quantized models as @Fridah-nv mentioned.

sugunav14 avatar May 15 '25 23:05 sugunav14

Merged in this MR

sugunav14 avatar Jun 05 '25 23:06 sugunav14