TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

attention fp8 compute type

Open enozhu opened this issue 1 year ago • 5 comments

when we use fp8 data type , we found ffn gemm/atten prj support real fp8 comute(this is supported on H20、L20), but Q*transopse(Key) or softmax * value in attention dosen't support fp8 compute, need to first dequantize fp8 to fp16/bf16, why ?

enozhu avatar Jul 09 '24 07:07 enozhu

@Tracin Could you please have a look? Thanks

QiJune avatar Jul 09 '24 12:07 QiJune

@enozhu Because we do not implement FP8 FMHA before Hopper. So I think H20 can support attention with FP8 computation.

Tracin avatar Jul 10 '24 06:07 Tracin

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

unbelievable3513 avatar Jul 10 '24 08:07 unbelievable3513

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.

Tracin avatar Jul 10 '24 08:07 Tracin

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Aug 10 '24 01:08 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Aug 25 '24 01:08 github-actions[bot]

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.

Could you please share researches on the impact of setting the scale to 1.0 on model accuracy? @Tracin

wanzhenchn avatar Sep 03 '24 03:09 wanzhenchn