TensorRT-LLM Llama 3 70B FP8 engine build failed with FMHA

Llama 3 70B FP8 engine build failed with FMHA

Open ayush1399 opened this issue 6 months ago • 2 comments

System Info

AWS p5 (4 x 80GB H100 GPUs) TensorRT-LLM v0.11.0

Who can help?

@byshiue @Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python ./quantize.py --model_dir ./Meta-Llama-3-70B-Instruct --dtype bfloat16 --output_dir ./Meta-Llama-3-70B-Instruct_fp8 --calib_size 1024 --calib_dataset /home/triton-server/calibration --tp_size 4 --qformat fp8

trtllm-build --checkpoint_dir ./Meta-Llama-3-70B-Instruct_fp8 --output_dir ./Meta-Llama-3-70B-Instruct_fp8_engine_fmha --gemm_plugin auto --workers 1 --use_paged_context_fmha enable --use_fp8_context_fmha enable --max_batch_size 16

Expected behavior

Engine created succesfully.

actual behavior

Engine build fails with

[08/18/2024-23:36:42] [TRT] [W] Detected layernorm nodes in FP16.
[08/18/2024-23:36:42] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/18/2024-23:36:42] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[08/18/2024-23:36:42] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: getIdx() should not be used with entry 16
 (/workspace/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp:127)
1       0x7fc56bf865ce /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7f5ce) [0x7fc56bf865ce]
2       0x7fc56bf86df0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7fdf0) [0x7fc56bf86df0]
3       0x7fc56c020d2e tensorrt_llm::plugins::GPTAttentionPlugin::supportsFormatCombination(int, nvinfer1::PluginTensorDesc const*, int, int) + 1118
4       0x7fc986625a14 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xafda14) [0x7fc986625a14]
5       0x7fc9868d3d33 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdabd33) [0x7fc9868d3d33]
6       0x7fc98665ef2d /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xb36f2d) [0x7fc98665ef2d]
7       0x7fc986939abe /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe11abe) [0x7fc986939abe]
8       0x7fc9867e116f /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcb916f) [0x7fc9867e116f]
9       0x7fc9867e9e0c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcc1e0c) [0x7fc9867e9e0c]
10      0x7fc986926c19 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdfec19) [0x7fc986926c19]
11      0x7fc98692e21c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0621c) [0x7fc98692e21c]
12      0x7fc986930328 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe08328) [0x7fc986930328]
13      0x7fc98657f2ac /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa572ac) [0x7fc98657f2ac]
14      0x7fc986584501 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5c501) [0x7fc986584501]
15      0x7fc986584f0b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5cf0b) [0x7fc986584f0b]
16      0x7fc92bca7458 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa7458) [0x7fc92bca7458]
17      0x7fc92bc458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7fc92bc458f3]
18      0x5578df496c9e /usr/bin/python3(+0x15ac9e) [0x5578df496c9e]
19      0x5578df48d3cb _PyObject_MakeTpCall + 603
20      0x5578df4a53eb /usr/bin/python3(+0x1693eb) [0x5578df4a53eb]
21      0x5578df48559a _PyEval_EvalFrameDefault + 25674
22      0x5578df49759c _PyFunction_Vectorcall + 124
23      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
24      0x5578df49759c _PyFunction_Vectorcall + 124
25      0x5578df47f96e _PyEval_EvalFrameDefault + 2078
26      0x5578df49759c _PyFunction_Vectorcall + 124
27      0x5578df47f827 _PyEval_EvalFrameDefault + 1751
28      0x5578df49759c _PyFunction_Vectorcall + 124
29      0x5578df4a5db2 PyObject_Call + 290
30      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
31      0x5578df49759c _PyFunction_Vectorcall + 124
32      0x5578df4a5db2 PyObject_Call + 290
33      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
34      0x5578df49759c _PyFunction_Vectorcall + 124
35      0x5578df4a5db2 PyObject_Call + 290
36      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
37      0x5578df49759c _PyFunction_Vectorcall + 124
38      0x5578df47f827 _PyEval_EvalFrameDefault + 1751
39      0x5578df47bf96 /usr/bin/python3(+0x13ff96) [0x5578df47bf96]
40      0x5578df571c66 PyEval_EvalCode + 134
41      0x5578df59cb38 /usr/bin/python3(+0x260b38) [0x5578df59cb38]
42      0x5578df5963fb /usr/bin/python3(+0x25a3fb) [0x5578df5963fb]
43      0x5578df59c885 /usr/bin/python3(+0x260885) [0x5578df59c885]
44      0x5578df59bd68 _PyRun_SimpleFileObject + 424
45      0x5578df59b9b3 _PyRun_AnyFileObject + 67
46      0x5578df58e45e Py_RunMain + 702
47      0x5578df564a3d Py_BytesMain + 45
48      0x7fc9ab1e4d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc9ab1e4d90]
49      0x7fc9ab1e4e40 __libc_start_main + 128
50      0x5578df564935 _start + 37

additional notes

The engine build runs fine when I don't include --use_paged_context_fmha enable --use_fp8_context_fmha enable on running trtllm-build.

Aug 18 '24 23:08 ayush1399

TensorRT-LLM TensorRT-LLM copied to clipboard

Llama 3 70B FP8 engine build failed with FMHA

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard