TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description

Open taozhang9527 opened this issue 1 year ago • 2 comments
trafficstars

System Info

CPU-X86 GPU-H100 Server XE9640 Code: TensorRT-LLM 0.8.0 release

Who can help?

@Tracin @juney-nvidia

Regarding the FP8 Post Quantization, it is mentioned in the note "Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable"

However, the --use_fp8_context_fmha enable is not an option for trtllm-build build option. All the options I can see are as follows: usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}]

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

python ../quantization/quantize.py --model_dir ./tmp/llama/70B
--dtype float16
--qformat fp8
--kv_cache_dtype fp8
--output_dir ./tllm_checkpoint_2gpu_fp8
--calib_size 512
--tp_size 2

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8
--output_dir ./engine_outputs
--gemm_plugin float16
--strongly_typed
--workers 2
--use_fp8_context_fmha enable

Expected behavior

Expect the engine build kicked off

actual behavior

trtllm-build: error: unrecognized arguments: --use_fp8_context_fmha enable

additional notes

There are many options for trtllm-build command. A detail documentation on those options would be useful for users to set the correct ones.

taozhang9527 avatar Apr 17 '24 06:04 taozhang9527

The flag use_fp8_context_fmha is not supported in v0.8.0, it is added in v0.9.0. Please try on v0.9.0.

byshiue avatar Apr 19 '24 08:04 byshiue

Yes, tried in 0.9.0, it is supported now.

What is the relationship between --use_fp8_context_fmha and --context_fmha enable. If I use --use_fp8_context_fmha, do I still need --context_fmha enable`?

In general, is there any doc for those different build options?

taozhang9527 avatar Apr 29 '24 21:04 taozhang9527

Yes, you need to enable both. It is explained in https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#fp8-context-fmha.

byshiue avatar May 09 '24 06:05 byshiue