feat: [AutoDeploy] DeepseekV3 e2e support with sdpa attention
Support deepseekv3 e2e example without attention forward patch
- [x] Modify "TritonWithFlattenedInputs" backend to support sdpa-style attention with different head dims for v_head_dim and qk_head_dim
- [x] Add unit tests for deepseek e2e example with triton kernels (single and multigpu case) with skip_loading_weights set to True
- TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights
- TODO: Use scale passed instead of default scale for attention op
TODO: DeepseekV3 weights are in FP8. Need to handle this case to run e2e example with weights
I think we currently don't have example support for quantized model not provided by ModelOpt, are we planning to support deepseek-ai/DeepSeek-V3 or ModelOpt quantized version of this model?
I wonder if this change enables deepseek-ai/DeepSeek-R1 to run as well?
@sugunav14, what's the issue with fp8 weight loading?
@sugunav14, what's the issue with fp8 weight loading?
DeepseekV3 weights are in fp8 on huggingface. Since we have the load_state_dict() patch in place now it loads the weights in fp8 and causes a dtype mismatch while running forward. I think currently we only support modelopt quantized models as @Fridah-nv mentioned.
Merged in this MR