trafficstars

System Info

CPU-X86 GPU-H100 Server XE9640 Code: TensorRT-LLM 0.8.0 release

Who can help?

@Tracin @juney-nvidia

Regarding the FP8 Post Quantization, it is mentioned in the note "Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable"

However, the --use_fp8_context_fmha enable is not an option for trtllm-build build option. All the options I can see are as follows: usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--timing_cache TIMING_CACHE] [--log_level LOG_LEVEL] [--profiling_verbosity {layer_names_only,detailed,none}] [--enable_debug_output] [--output_dir OUTPUT_DIR] [--workers WORKERS] [--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_output_len MAX_OUTPUT_LEN] [--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE] [--use_fused_mlp] [--gather_all_token_logits] [--gather_context_logits] [--gather_generation_logits] [--strongly_typed] [--builder_opt BUILDER_OPT] [--logits_dtype {float16,float32}] [--weight_only_precision {int8,int4}] [--bert_attention_plugin {float16,float32,bfloat16,disable}] [--gpt_attention_plugin {float16,float32,bfloat16,disable}] [--gemm_plugin {float16,float32,bfloat16,disable}] [--lookup_plugin {float16,float32,bfloat16,disable}] [--lora_plugin {float16,float32,bfloat16,disable}] [--context_fmha {enable,disable}] [--context_fmha_fp32_acc {enable,disable}] [--paged_kv_cache {enable,disable}] [--remove_input_padding {enable,disable}] [--use_custom_all_reduce {enable,disable}] [--multi_block_mode {enable,disable}] [--enable_xqa {enable,disable}] [--attention_qk_half_accumulation {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}] [--use_context_fmha_for_generation {enable,disable}]

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

python ../quantization/quantize.py --model_dir ./tmp/llama/70B
--dtype float16
--qformat fp8
--kv_cache_dtype fp8
--output_dir ./tllm_checkpoint_2gpu_fp8
--calib_size 512
--tp_size 2

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8
--output_dir ./engine_outputs
--gemm_plugin float16
--strongly_typed
--workers 2
--use_fp8_context_fmha enable

Expected behavior

Expect the engine build kicked off

actual behavior

trtllm-build: error: unrecognized arguments: --use_fp8_context_fmha enable

additional notes

There are many options for trtllm-build command. A detail documentation on those options would be useful for users to set the correct ones.

Apr 17 '24 06:04 taozhang9527

The flag use_fp8_context_fmha is not supported in v0.8.0, it is added in v0.9.0. Please try on v0.9.0.

Apr 19 '24 08:04 byshiue

Yes, tried in 0.9.0, it is supported now.

What is the relationship between --use_fp8_context_fmha and --context_fmha enable. If I use --use_fp8_context_fmha, do I still need --context_fmha enable`?

In general, is there any doc for those different build options?

Apr 29 '24 21:04 taozhang9527

Yes, you need to enable both. It is explained in https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#fp8-context-fmha.

May 09 '24 06:05 byshiue

TensorRT-LLM
TensorRT-LLM copied to clipboard

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description

System Info

Who can help?

Information

Tasks

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`

Expected behavior

actual behavior

additional notes

TensorRT-LLM TensorRT-LLM copied to clipboard

[FP8 Post-Training Quantization] "use_fp8_context_fmha" Not Supported As Description

System Info

Who can help?

Information

Tasks

Reproduction

Quantize HF LLaMA 70B into FP8 and export trtllm checkpoint

Build trtllm engines from the trtllm checkpoint

Enable fp8 context fmha to get further acceleration by setting --use_fp8_context_fmha enable

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard

Enable fp8 context fmha to get further acceleration by setting `--use_fp8_context_fmha enable`