attention fp8 compute type
when we use fp8 data type , we found ffn gemm/atten prj support real fp8 comute(this is supported on H20、L20), but Q*transopse(Key) or softmax * value in attention dosen't support fp8 compute, need to first dequantize fp8 to fp16/bf16, why ?
@Tracin Could you please have a look? Thanks
@enozhu Because we do not implement FP8 FMHA before Hopper. So I think H20 can support attention with FP8 computation.
When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:
if (params.quantized_fp8_output)
{
// use 1.0f scale currently for qkv input of FP8 FMHA.
mmha::convert_to_fp8(quantized_q_ptr, q);
}
Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin
When enabling
use_fp8_context_fmha, the context phase is supposed to perform theQ@K^Toperation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the functionapplyBiasRopeUpdateKVCachewithconvert_to_fp8:if (params.quantized_fp8_output) { // use 1.0f scale currently for qkv input of FP8 FMHA. mmha::convert_to_fp8(quantized_q_ptr, q); }Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin
Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.
When enabling
use_fp8_context_fmha, the context phase is supposed to perform theQ@K^Toperation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the functionapplyBiasRopeUpdateKVCachewithconvert_to_fp8:if (params.quantized_fp8_output) { // use 1.0f scale currently for qkv input of FP8 FMHA. mmha::convert_to_fp8(quantized_q_ptr, q); }Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin
Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.
Could you please share researches on the impact of setting the scale to 1.0 on model accuracy? @Tracin