TensorRT-LLM
TensorRT-LLM copied to clipboard
Llama 3 70B FP8 engine build failed with FMHA
System Info
AWS p5 (4 x 80GB H100 GPUs) TensorRT-LLM v0.11.0
Who can help?
@byshiue @Tracin
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
python ./quantize.py --model_dir ./Meta-Llama-3-70B-Instruct --dtype bfloat16 --output_dir ./Meta-Llama-3-70B-Instruct_fp8 --calib_size 1024 --calib_dataset /home/triton-server/calibration --tp_size 4 --qformat fp8
trtllm-build --checkpoint_dir ./Meta-Llama-3-70B-Instruct_fp8 --output_dir ./Meta-Llama-3-70B-Instruct_fp8_engine_fmha --gemm_plugin auto --workers 1 --use_paged_context_fmha enable --use_fp8_context_fmha enable --max_batch_size 16
Expected behavior
Engine created succesfully.
actual behavior
Engine build fails with
[08/18/2024-23:36:42] [TRT] [W] Detected layernorm nodes in FP16.
[08/18/2024-23:36:42] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/18/2024-23:36:42] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[08/18/2024-23:36:42] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: getIdx() should not be used with entry 16
(/workspace/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp:127)
1 0x7fc56bf865ce /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7f5ce) [0x7fc56bf865ce]
2 0x7fc56bf86df0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7fdf0) [0x7fc56bf86df0]
3 0x7fc56c020d2e tensorrt_llm::plugins::GPTAttentionPlugin::supportsFormatCombination(int, nvinfer1::PluginTensorDesc const*, int, int) + 1118
4 0x7fc986625a14 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xafda14) [0x7fc986625a14]
5 0x7fc9868d3d33 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdabd33) [0x7fc9868d3d33]
6 0x7fc98665ef2d /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xb36f2d) [0x7fc98665ef2d]
7 0x7fc986939abe /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe11abe) [0x7fc986939abe]
8 0x7fc9867e116f /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcb916f) [0x7fc9867e116f]
9 0x7fc9867e9e0c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcc1e0c) [0x7fc9867e9e0c]
10 0x7fc986926c19 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdfec19) [0x7fc986926c19]
11 0x7fc98692e21c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0621c) [0x7fc98692e21c]
12 0x7fc986930328 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe08328) [0x7fc986930328]
13 0x7fc98657f2ac /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa572ac) [0x7fc98657f2ac]
14 0x7fc986584501 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5c501) [0x7fc986584501]
15 0x7fc986584f0b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5cf0b) [0x7fc986584f0b]
16 0x7fc92bca7458 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa7458) [0x7fc92bca7458]
17 0x7fc92bc458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7fc92bc458f3]
18 0x5578df496c9e /usr/bin/python3(+0x15ac9e) [0x5578df496c9e]
19 0x5578df48d3cb _PyObject_MakeTpCall + 603
20 0x5578df4a53eb /usr/bin/python3(+0x1693eb) [0x5578df4a53eb]
21 0x5578df48559a _PyEval_EvalFrameDefault + 25674
22 0x5578df49759c _PyFunction_Vectorcall + 124
23 0x5578df481a9d _PyEval_EvalFrameDefault + 10573
24 0x5578df49759c _PyFunction_Vectorcall + 124
25 0x5578df47f96e _PyEval_EvalFrameDefault + 2078
26 0x5578df49759c _PyFunction_Vectorcall + 124
27 0x5578df47f827 _PyEval_EvalFrameDefault + 1751
28 0x5578df49759c _PyFunction_Vectorcall + 124
29 0x5578df4a5db2 PyObject_Call + 290
30 0x5578df481a9d _PyEval_EvalFrameDefault + 10573
31 0x5578df49759c _PyFunction_Vectorcall + 124
32 0x5578df4a5db2 PyObject_Call + 290
33 0x5578df481a9d _PyEval_EvalFrameDefault + 10573
34 0x5578df49759c _PyFunction_Vectorcall + 124
35 0x5578df4a5db2 PyObject_Call + 290
36 0x5578df481a9d _PyEval_EvalFrameDefault + 10573
37 0x5578df49759c _PyFunction_Vectorcall + 124
38 0x5578df47f827 _PyEval_EvalFrameDefault + 1751
39 0x5578df47bf96 /usr/bin/python3(+0x13ff96) [0x5578df47bf96]
40 0x5578df571c66 PyEval_EvalCode + 134
41 0x5578df59cb38 /usr/bin/python3(+0x260b38) [0x5578df59cb38]
42 0x5578df5963fb /usr/bin/python3(+0x25a3fb) [0x5578df5963fb]
43 0x5578df59c885 /usr/bin/python3(+0x260885) [0x5578df59c885]
44 0x5578df59bd68 _PyRun_SimpleFileObject + 424
45 0x5578df59b9b3 _PyRun_AnyFileObject + 67
46 0x5578df58e45e Py_RunMain + 702
47 0x5578df564a3d Py_BytesMain + 45
48 0x7fc9ab1e4d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc9ab1e4d90]
49 0x7fc9ab1e4e40 __libc_start_main + 128
50 0x5578df564935 _start + 37
additional notes
The engine build runs fine when I don't include --use_paged_context_fmha enable --use_fp8_context_fmha enable
on running trtllm-build
.