TensorRT-LLM
TensorRT-LLM copied to clipboard
Fail to build Llama-3-70B-Instruct with w4a16
System Info
tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700
A100 40G
Who can help?
@byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
set -ex
export MODEL_DIR=/mnt/memory
export MODEL_OUTPUT_DIR=$MODEL_DIR/tmp
export MODEL_NAME=Meta-Llama-3-70B-Instruct
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/tensorrt/bin:$PATH
export DTYPE=float16
export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH
export WEIGHT_BIT=4
export PARALLEL_SIZE=1
export QUANTIZE=w${WEIGHT_BIT}a16
export WEIGHT_PRESISION=int${WEIGHT_BIT}
export QUANTIZE_PARAM="--use_weight_only --weight_only_precision $WEIGHT_PRESISION"
python ../llama/convert_checkpoint.py \
--model_dir $MODEL_DIR/${MODEL_NAME} \
--output_dir $MODEL_OUTPUT_DIR/trt_models/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
--dtype $DTYPE $QUANTIZE_PARAM \
--load_model_on_cpu
trtllm-build \
--checkpoint_dir $MODEL_OUTPUT_DIR/trt_models/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
--output_dir $MODEL_OUTPUT_DIR/trt_engines/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
--gemm_plugin $DTYPE \
--gpt_attention_plugin $DTYPE \
--max_batch_size 1 \
--max_input_len 2048 \
--max_output_len 1024
Expected behavior
build success
actual behavior
trtllm-build with error
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/62/post_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/60/post_layernorm/SHUFFLE_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/post_layernorm/ELEMENTWISE_POW_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/55/mlp/gate/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/69/input_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/SHUFFLE_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Detected layernorm nodes in FP16.
[05/20/2024-14:41:10] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[05/20/2024-14:41:10] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/20/2024-14:41:10] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/20/2024-14:41:22] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[05/20/2024-14:41:22] [TRT] [I] Detected 14 inputs and 1 output network tensors.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] TllmXqaJit runtime error in tllmXqaJitCreateAndCompileProgram(&program, &context): NVRTC Internal Error (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/compileEngine.cpp:65)
1 0x7fe4b37bb384 /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(+0x68f384) [0x7fe4b37bb384]
2 0x7fe4b38f9855 tensorrt_llm::kernels::jit::CompileEngine::compile() const + 165
3 0x7fe4b38faaf8 tensorrt_llm::kernels::DecoderXQAImplJIT::prepare(tensorrt_llm::kernels::XQAParams const&) + 296
4 0x7fe47c5fc508 /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x9b508) [0x7fe47c5fc508]
5 0x7fe47c61855a /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb755a) [0x7fe47c61855a]
6 0x7fe57c888f38 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd87f38) [0x7fe57c888f38]
7 0x7fe57c88985c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd8885c) [0x7fe57c88985c]
8 0x7fe57c902caf /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xe01caf) [0x7fe57c902caf]
9 0x7fe57c8db4e0 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xdda4e0) [0x7fe57c8db4e0]
10 0x7fe57c8e207c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c) [0x7fe57c8e207c]
11 0x7fe57c8e4071 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071) [0x7fe57c8e4071]
12 0x7fe57c52961c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c) [0x7fe57c52961c]
13 0x7fe57c52e837 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837) [0x7fe57c52e837]
14 0x7fe57c52f1af /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af) [0x7fe57c52f1af]
15 0x7fe5293e2478 /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478) [0x7fe5293e2478]
16 0x7fe5293817a3 /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3) [0x7fe5293817a3]
17 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
18 0x629d24 _PyObject_MakeTpCall + 356
19 0x549c2e /app/venv_dev/bin/python3() [0x549c2e]
20 0x5ae603 _PyEval_EvalFrameDefault + 19699
21 0x628d60 _PyFunction_Vectorcall + 592
22 0x628a3a PyObject_Call + 426
23 0x5ac51b _PyEval_EvalFrameDefault + 11275
24 0x628d60 _PyFunction_Vectorcall + 592
25 0x5aa025 _PyEval_EvalFrameDefault + 1813
26 0x628d60 _PyFunction_Vectorcall + 592
27 0x5a9c1b _PyEval_EvalFrameDefault + 779
28 0x628d60 _PyFunction_Vectorcall + 592
29 0x62893c PyObject_Call + 172
30 0x5ac51b _PyEval_EvalFrameDefault + 11275
31 0x628d60 _PyFunction_Vectorcall + 592
32 0x62893c PyObject_Call + 172
33 0x5ac51b _PyEval_EvalFrameDefault + 11275
34 0x628d60 _PyFunction_Vectorcall + 592
35 0x62893c PyObject_Call + 172
36 0x5ac51b _PyEval_EvalFrameDefault + 11275
37 0x628d60 _PyFunction_Vectorcall + 592
38 0x5a9c1b _PyEval_EvalFrameDefault + 779
39 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
40 0x6d77cf PyEval_EvalCode + 127
41 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
42 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
43 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
44 0x6c0c84 _PyRun_SimpleFileObject + 404
45 0x6c0d57 _PyRun_AnyFileObject + 71
46 0x7042dd Py_RunMain + 877
47 0x7044bd Py_BytesMain + 45
48 0x7fe6e0245083 __libc_start_main + 243
49 0x62ff4e _start + 46
[tensorrt-llm-build-95p6x:320425] *** Process received signal ***
[tensorrt-llm-build-95p6x:320425] Signal: Aborted (6)
[tensorrt-llm-build-95p6x:320425] Signal code: (-6)
[tensorrt-llm-build-95p6x:320425] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe6e0264090]
[tensorrt-llm-build-95p6x:320425] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe6e026400b]
[tensorrt-llm-build-95p6x:320425] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe6e0243859]
[tensorrt-llm-build-95p6x:320425] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7fe63e03d8d1]
[tensorrt-llm-build-95p6x:320425] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7fe63e04937c]
[tensorrt-llm-build-95p6x:320425] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9359)[0x7fe63e048359]
[tensorrt-llm-build-95p6x:320425] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7fe63e048d11]
[tensorrt-llm-build-95p6x:320425] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bff)[0x7fe6dfe0fbff]
[tensorrt-llm-build-95p6x:320425] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fe6dfe105ba]
[tensorrt-llm-build-95p6x:320425] [ 9] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(+0x68f3d2)[0x7fe4b37bb3d2]
[tensorrt-llm-build-95p6x:320425] [10] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(_ZNK12tensorrt_llm7kernels3jit13CompileEngine7compileEv+0xa5)[0x7fe4b38f9855]
[tensorrt-llm-build-95p6x:320425] [11] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7kernels17DecoderXQAImplJIT7prepareERKNS0_9XQAParamsE+0x128)[0x7fe4b38faaf8]
[tensorrt-llm-build-95p6x:320425] [12] /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x9b508)[0x7fe47c5fc508]
[tensorrt-llm-build-95p6x:320425] [13] /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb755a)[0x7fe47c61855a]
[tensorrt-llm-build-95p6x:320425] [14] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd87f38)[0x7fe57c888f38]
[tensorrt-llm-build-95p6x:320425] [15] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd8885c)[0x7fe57c88985c]
[tensorrt-llm-build-95p6x:320425] [16] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xe01caf)[0x7fe57c902caf]
[tensorrt-llm-build-95p6x:320425] [17] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xdda4e0)[0x7fe57c8db4e0]
[tensorrt-llm-build-95p6x:320425] [18] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c)[0x7fe57c8e207c]
[tensorrt-llm-build-95p6x:320425] [19] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071)[0x7fe57c8e4071]
[tensorrt-llm-build-95p6x:320425] [20] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c)[0x7fe57c52961c]
[tensorrt-llm-build-95p6x:320425] [21] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837)[0x7fe57c52e837]
[tensorrt-llm-build-95p6x:320425] [22] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af)[0x7fe57c52f1af]
[tensorrt-llm-build-95p6x:320425] [23] /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478)[0x7fe5293e2478]
[tensorrt-llm-build-95p6x:320425] [24] /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3)[0x7fe5293817a3]
[tensorrt-llm-build-95p6x:320425] [25] /app/venv_dev/bin/python3[0x53bd79]
[tensorrt-llm-build-95p6x:320425] [26] /app/venv_dev/bin/python3(_PyObject_MakeTpCall+0x164)[0x629d24]
[tensorrt-llm-build-95p6x:320425] [27] /app/venv_dev/bin/python3[0x549c2e]
[tensorrt-llm-build-95p6x:320425] [28] /app/venv_dev/bin/python3(_PyEval_EvalFrameDefault+0x4cf3)[0x5ae603]
[tensorrt-llm-build-95p6x:320425] [29] /app/venv_dev/bin/python3(_PyFunction_Vectorcall+0x250)[0x628d60]
[tensorrt-llm-build-95p6x:320425] *** End of error message ***
Aborted (core dumped)
additional notes
Meta-Llama-3-70B-Instruct w4a16 supported right now?
Thanks for reporting the issue.
This issue has been fixed and the fix will be included in the future update.
Actually, the issue should have been fixed already in the update last week (0514).
@gloritygithub11 could you try with tensorrt-llm 0.10.0.dev2024051400 and let us know whether the issue is still reproducible?
Thanks!
@ming-wei Thanks. I will try it
@ming-wei I have synced to commit "[Fix mistral v0.1 build instructions (#1373)]", now it failed in convert with error:
python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-70B-Instruct --output_dir /app/models/tmp/trt_models/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400
0.11.0.dev2024051400
Traceback (most recent call last):
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 464, in <module>
main()
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 448, in main
convert_and_save_hf(args)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 384, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 407, in execute
f(args, rank)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 371, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/app/tensorrt-llm/tensorrt_llm/models/llama/model.py", line 280, in from_hugging_face
llama = convert.from_hugging_face(
File "/app/tensorrt-llm/tensorrt_llm/models/llama/convert.py", line 1337, in from_hugging_face
llama.load(weights)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 419, in load
raise RuntimeError(
RuntimeError: Expected but not provided tensors:{'transformer.layers.6.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.fc.per_channel_scale', 'transformer.layers.13.attention.dense.per_channel_scale', 'transformer.layers.26.attention.dense.per_channel_scale', 'transformer.layers.32.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.gate.per_channel_scale', 'transformer.layers.74.attention.qkv.per_channel_scale', 'transformer.layers.49.mlp.gate.per_channel_scale', 'transformer.layers.41.attention.dense.per_channel_scale', 'transformer.layers.22.attention.qkv.per_channel_scale', 'transformer.layers.21.mlp.proj.per_channel_scale', 'transformer.layers.11.attention.dense.per_channel_scale', 'transformer.layers.67.mlp.gate.per_channel_scale', 'transformer.layers.7.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.fc.per_channel_scale', 'transformer.layers.24.mlp.gate.per_channel_scale', 'transformer.layers.77.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.qkv.per_channel_scale', 'transformer.layers.30.attention.qkv.per_channel_scale', 'transformer.layers.67.mlp.proj.per_channel_scale', 'transformer.layers.9.attention.dense.per_channel_scale', 'transformer.layers.47.mlp.fc.per_channel_scale', 'transformer.layers.68.attention.qkv.per_channel_scale', 'transformer.layers.59.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.proj.per_channel_scale', 'transformer.layers.58.mlp.fc.per_channel_scale', 'transformer.layers.36.attention.qkv.per_channel_scale', 'transformer.layers.78.mlp.fc.per_channel_scale', 'transformer.layers.70.mlp.proj.per_channel_scale', 'transformer.layers.15.attention.qkv.per_channel_scale', 'transformer.layers.6.attention.qkv.per_channel_scale', 'transformer.layers.43.mlp.proj.per_channel_scale', 'transformer.layers.67.attention.dense.per_channel_scale', 'transformer.layers.28.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.gate.per_channel_scale', 'transformer.layers.73.mlp.gate.per_channel_scale', 'transformer.layers.29.mlp.proj.per_channel_scale', 'transformer.layers.59.mlp.fc.per_channel_scale', 'transformer.layers.60.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.dense.per_channel_scale', 'transformer.layers.79.mlp.gate.per_channel_scale', 'transformer.layers.58.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.proj.per_channel_scale', 'transformer.layers.74.mlp.proj.per_channel_scale', 'transformer.layers.48.attention.qkv.per_channel_scale', 'transformer.layers.54.attention.dense.per_channel_scale', 'transformer.layers.70.mlp.gate.per_channel_scale', 'transformer.layers.48.mlp.proj.per_channel_scale', 'transformer.layers.65.attention.qkv.per_channel_scale', 'transformer.layers.48.mlp.gate.per_channel_scale', 'transformer.layers.48.mlp.fc.per_channel_scale', 'transformer.layers.78.attention.qkv.per_channel_scale', 'transformer.layers.71.mlp.gate.per_channel_scale', 'transformer.layers.5.attention.dense.per_channel_scale', 'transformer.layers.62.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.proj.per_channel_scale', 'transformer.layers.40.mlp.gate.per_channel_scale', 'transformer.layers.60.attention.qkv.per_channel_scale', 'transformer.layers.21.attention.dense.per_channel_scale', 'transformer.layers.48.attention.dense.per_channel_scale', 'transformer.layers.0.attention.qkv.per_channel_scale', 'transformer.layers.32.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.fc.per_channel_scale', 'transformer.layers.9.mlp.proj.per_channel_scale', 'transformer.layers.12.mlp.gate.per_channel_scale', 'transformer.layers.49.attention.dense.per_channel_scale', 'transformer.layers.72.mlp.proj.per_channel_scale', 'transformer.layers.22.mlp.gate.per_channel_scale', 'transformer.layers.8.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.fc.per_channel_scale', 'transformer.layers.3.attention.qkv.per_channel_scale', 'transformer.layers.76.attention.qkv.per_channel_scale', 'transformer.layers.69.attention.qkv.per_channel_scale', 'transformer.layers.8.mlp.gate.per_channel_scale', 'transformer.layers.63.mlp.gate.per_channel_scale', 'transformer.layers.24.mlp.fc.per_channel_scale', 'transformer.layers.43.attention.qkv.per_channel_scale', 'transformer.layers.35.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.fc.per_channel_scale', 'transformer.layers.17.mlp.fc.per_channel_scale', 'transformer.layers.79.attention.dense.per_channel_scale', 'transformer.layers.64.attention.qkv.per_channel_scale', 'transformer.layers.7.mlp.fc.per_channel_scale', 'transformer.layers.77.attention.dense.per_channel_scale', 'transformer.layers.66.attention.qkv.per_channel_scale', 'transformer.layers.50.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.proj.per_channel_scale', 'transformer.layers.25.attention.dense.per_channel_scale', 'transformer.layers.12.mlp.fc.per_channel_scale', 'transformer.layers.71.mlp.fc.per_channel_scale', 'transformer.layers.50.attention.dense.per_channel_scale', 'transformer.layers.16.attention.dense.per_channel_scale', 'transformer.layers.61.attention.qkv.per_channel_scale', 'transformer.layers.72.mlp.fc.per_channel_scale', 'transformer.layers.77.mlp.gate.per_channel_scale', 'transformer.layers.11.attention.qkv.per_channel_scale', 'transformer.layers.34.mlp.gate.per_channel_scale', 'transformer.layers.13.attention.qkv.per_channel_scale', 'transformer.layers.6.mlp.gate.per_channel_scale', 'transformer.layers.40.mlp.proj.per_channel_scale', 'transformer.layers.50.mlp.gate.per_channel_scale', 'transformer.layers.52.mlp.fc.per_channel_scale', 'transformer.layers.65.mlp.gate.per_channel_scale', 'transformer.layers.74.attention.dense.per_channel_scale', 'transformer.layers.45.mlp.gate.per_channel_scale', 'transformer.layers.69.mlp.proj.per_channel_scale', 'transformer.layers.41.mlp.fc.per_channel_scale', 'transformer.layers.53.mlp.fc.per_channel_scale', 'transformer.layers.17.mlp.proj.per_channel_scale', 'transformer.layers.56.mlp.gate.per_channel_scale', 'transformer.layers.2.attention.dense.per_channel_scale', 'transformer.layers.61.mlp.fc.per_channel_scale', 'transformer.layers.8.mlp.proj.per_channel_scale', 'transformer.layers.23.mlp.proj.per_channel_scale', 'transformer.layers.37.mlp.gate.per_channel_scale', 'transformer.layers.16.attention.qkv.per_channel_scale', 'transformer.layers.30.mlp.proj.per_channel_scale', 'transformer.layers.44.mlp.gate.per_channel_scale', 'transformer.layers.76.mlp.gate.per_channel_scale', 'transformer.layers.41.mlp.gate.per_channel_scale', 'transformer.layers.62.mlp.fc.per_channel_scale', 'transformer.layers.59.mlp.proj.per_channel_scale', 'transformer.layers.44.mlp.fc.per_channel_scale', 'transformer.layers.13.mlp.fc.per_channel_scale', 'transformer.layers.64.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.dense.per_channel_scale', 'transformer.layers.14.mlp.proj.per_channel_scale', 'transformer.layers.17.mlp.gate.per_channel_scale', 'transformer.layers.32.mlp.fc.per_channel_scale', 'transformer.layers.15.mlp.gate.per_channel_scale', 'transformer.layers.33.mlp.fc.per_channel_scale', 'transformer.layers.2.attention.qkv.per_channel_scale', 'transformer.layers.5.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.qkv.per_channel_scale', 'transformer.layers.55.mlp.proj.per_channel_scale', 'transformer.layers.39.mlp.gate.per_channel_scale', 'transformer.layers.34.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.fc.per_channel_scale', 'transformer.layers.31.attention.dense.per_channel_scale', 'transformer.layers.54.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.proj.per_channel_scale', 'transformer.layers.25.attention.qkv.per_channel_scale', 'transformer.layers.23.attention.qkv.per_channel_scale', 'transformer.layers.11.mlp.proj.per_channel_scale', 'transformer.layers.52.mlp.proj.per_channel_scale', 'transformer.layers.53.attention.qkv.per_channel_scale', 'transformer.layers.76.attention.dense.per_channel_scale', 'transformer.layers.10.attention.dense.per_channel_scale', 'transformer.layers.76.mlp.fc.per_channel_scale', 'transformer.layers.18.mlp.gate.per_channel_scale', 'transformer.layers.75.attention.qkv.per_channel_scale', 'transformer.layers.36.attention.dense.per_channel_scale', 'transformer.layers.42.attention.qkv.per_channel_scale', 'transformer.layers.23.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.proj.per_channel_scale', 'transformer.layers.51.mlp.proj.per_channel_scale', 'transformer.layers.55.attention.dense.per_channel_scale', 'transformer.layers.33.mlp.proj.per_channel_scale', 'transformer.layers.0.attention.dense.per_channel_scale', 'transformer.layers.22.mlp.fc.per_channel_scale', 'transformer.layers.46.mlp.fc.per_channel_scale', 'transformer.layers.39.mlp.proj.per_channel_scale', 'transformer.layers.43.mlp.gate.per_channel_scale', 'transformer.layers.45.mlp.fc.per_channel_scale', 'transformer.layers.65.mlp.fc.per_channel_scale', 'transformer.layers.60.attention.dense.per_channel_scale', 'transformer.layers.36.mlp.fc.per_channel_scale', 'transformer.layers.72.attention.dense.per_channel_scale', 'transformer.layers.55.mlp.fc.per_channel_scale', 'transformer.layers.74.mlp.fc.per_channel_scale', 'transformer.layers.18.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.qkv.per_channel_scale', 'transformer.layers.51.mlp.gate.per_channel_scale', 'transformer.layers.44.attention.qkv.per_channel_scale', 'transformer.layers.79.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.fc.per_channel_scale', 'transformer.layers.33.attention.dense.per_channel_scale', 'transformer.layers.2.mlp.proj.per_channel_scale', 'transformer.layers.34.mlp.proj.per_channel_scale', 'transformer.layers.74.mlp.gate.per_channel_scale', 'transformer.layers.5.mlp.proj.per_channel_scale', 'transformer.layers.59.mlp.gate.per_channel_scale', 'transformer.layers.7.mlp.proj.per_channel_scale', 'transformer.layers.49.mlp.fc.per_channel_scale', 'transformer.layers.71.attention.dense.per_channel_scale', 'transformer.layers.71.mlp.proj.per_channel_scale', 'transformer.layers.76.mlp.proj.per_channel_scale', 'transformer.layers.60.mlp.fc.per_channel_scale', 'transformer.layers.7.mlp.gate.per_channel_scale', 'transformer.layers.67.attention.qkv.per_channel_scale', 'transformer.layers.75.attention.dense.per_channel_scale', 'transformer.layers.75.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.gate.per_channel_scale', 'transformer.layers.61.mlp.proj.per_channel_scale', 'transformer.layers.28.attention.qkv.per_channel_scale', 'transformer.layers.51.attention.dense.per_channel_scale', 'transformer.layers.40.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.qkv.per_channel_scale', 'transformer.layers.36.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.gate.per_channel_scale', 'transformer.layers.58.mlp.gate.per_channel_scale', 'transformer.layers.64.mlp.proj.per_channel_scale', 'transformer.layers.33.attention.qkv.per_channel_scale', 'transformer.layers.41.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.proj.per_channel_scale', 'transformer.layers.14.attention.qkv.per_channel_scale', 'transformer.layers.10.mlp.fc.per_channel_scale', 'transformer.layers.47.attention.qkv.per_channel_scale', 'transformer.layers.53.attention.dense.per_channel_scale', 'transformer.layers.39.attention.qkv.per_channel_scale', 'transformer.layers.34.mlp.fc.per_channel_scale', 'transformer.layers.20.attention.dense.per_channel_scale', 'transformer.layers.42.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.proj.per_channel_scale', 'transformer.layers.47.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.gate.per_channel_scale', 'transformer.layers.9.mlp.gate.per_channel_scale', 'transformer.layers.39.attention.dense.per_channel_scale', 'transformer.layers.38.mlp.gate.per_channel_scale', 'transformer.layers.55.mlp.gate.per_channel_scale', 'transformer.layers.39.mlp.fc.per_channel_scale', 'transformer.layers.28.attention.dense.per_channel_scale', 'transformer.layers.3.attention.dense.per_channel_scale', 'transformer.layers.60.mlp.proj.per_channel_scale', 'transformer.layers.30.attention.dense.per_channel_scale', 'transformer.layers.42.mlp.fc.per_channel_scale', 'transformer.layers.58.attention.qkv.per_channel_scale', 'transformer.layers.9.mlp.fc.per_channel_scale', 'transformer.layers.10.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.proj.per_channel_scale', 'transformer.layers.43.attention.dense.per_channel_scale', 'transformer.layers.43.mlp.fc.per_channel_scale', 'transformer.layers.49.attention.qkv.per_channel_scale', 'transformer.layers.58.attention.dense.per_channel_scale', 'transformer.layers.19.mlp.gate.per_channel_scale', 'transformer.layers.37.attention.qkv.per_channel_scale', 'transformer.layers.19.attention.dense.per_channel_scale', 'transformer.layers.69.mlp.gate.per_channel_scale', 'transformer.layers.32.attention.qkv.per_channel_scale', 'transformer.layers.45.mlp.proj.per_channel_scale', 'transformer.layers.51.mlp.fc.per_channel_scale', 'transformer.layers.35.mlp.proj.per_channel_scale', 'transformer.layers.54.mlp.fc.per_channel_scale', 'transformer.layers.35.attention.dense.per_channel_scale', 'transformer.layers.61.mlp.gate.per_channel_scale', 'transformer.layers.65.mlp.proj.per_channel_scale', 'transformer.layers.38.mlp.proj.per_channel_scale', 'transformer.layers.2.mlp.fc.per_channel_scale', 'transformer.layers.23.mlp.fc.per_channel_scale', 'transformer.layers.75.mlp.fc.per_channel_scale', 'transformer.layers.47.attention.dense.per_channel_scale', 'transformer.layers.29.mlp.fc.per_channel_scale', 'transformer.layers.69.attention.dense.per_channel_scale', 'transformer.layers.68.mlp.fc.per_channel_scale', 'transformer.layers.42.attention.dense.per_channel_scale', 'transformer.layers.13.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.fc.per_channel_scale', 'transformer.layers.66.attention.dense.per_channel_scale', 'transformer.layers.1.attention.dense.per_channel_scale', 'transformer.layers.63.mlp.proj.per_channel_scale', 'transformer.layers.62.attention.dense.per_channel_scale', 'transformer.layers.8.attention.dense.per_channel_scale', 'transformer.layers.57.mlp.gate.per_channel_scale', 'transformer.layers.46.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.proj.per_channel_scale', 'transformer.layers.54.mlp.proj.per_channel_scale', 'transformer.layers.38.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.proj.per_channel_scale', 'transformer.layers.66.mlp.fc.per_channel_scale', 'transformer.layers.64.attention.dense.per_channel_scale', 'transformer.layers.23.attention.dense.per_channel_scale', 'transformer.layers.26.mlp.gate.per_channel_scale', 'transformer.layers.20.mlp.proj.per_channel_scale', 'transformer.layers.5.attention.qkv.per_channel_scale', 'transformer.layers.28.mlp.gate.per_channel_scale', 'transformer.layers.47.mlp.proj.per_channel_scale', 'transformer.layers.0.mlp.gate.per_channel_scale', 'transformer.layers.75.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.fc.per_channel_scale', 'transformer.layers.25.mlp.fc.per_channel_scale', 'transformer.layers.37.mlp.proj.per_channel_scale', 'transformer.layers.56.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.dense.per_channel_scale', 'transformer.layers.38.attention.qkv.per_channel_scale', 'transformer.layers.35.attention.qkv.per_channel_scale', 'transformer.layers.32.attention.dense.per_channel_scale', 'transformer.layers.71.attention.qkv.per_channel_scale', 'transformer.layers.66.mlp.proj.per_channel_scale', 'transformer.layers.19.attention.qkv.per_channel_scale', 'transformer.layers.72.attention.qkv.per_channel_scale', 'transformer.layers.46.attention.dense.per_channel_scale', 'transformer.layers.35.mlp.fc.per_channel_scale', 'transformer.layers.77.mlp.fc.per_channel_scale', 'transformer.layers.56.attention.dense.per_channel_scale', 'transformer.layers.31.attention.qkv.per_channel_scale', 'transformer.layers.63.mlp.fc.per_channel_scale', 'transformer.layers.68.mlp.proj.per_channel_scale', 'transformer.layers.37.mlp.fc.per_channel_scale', 'transformer.layers.37.attention.dense.per_channel_scale', 'transformer.layers.61.attention.dense.per_channel_scale', 'transformer.layers.10.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.gate.per_channel_scale', 'transformer.layers.26.attention.qkv.per_channel_scale', 'transformer.layers.40.attention.qkv.per_channel_scale', 'transformer.layers.22.attention.dense.per_channel_scale', 'transformer.layers.70.attention.qkv.per_channel_scale', 'transformer.layers.16.mlp.fc.per_channel_scale', 'transformer.layers.7.attention.dense.per_channel_scale', 'transformer.layers.46.mlp.gate.per_channel_scale', 'transformer.layers.64.mlp.gate.per_channel_scale', 'transformer.layers.68.mlp.gate.per_channel_scale', 'transformer.layers.19.mlp.fc.per_channel_scale', 'transformer.layers.63.attention.dense.per_channel_scale', 'transformer.layers.44.attention.dense.per_channel_scale', 'transformer.layers.27.mlp.fc.per_channel_scale', 'transformer.layers.57.mlp.proj.per_channel_scale', 'transformer.layers.11.mlp.fc.per_channel_scale', 'transformer.layers.24.attention.qkv.per_channel_scale', 'transformer.layers.18.mlp.fc.per_channel_scale', 'transformer.layers.56.mlp.fc.per_channel_scale', 'transformer.layers.53.mlp.gate.per_channel_scale', 'transformer.layers.4.mlp.gate.per_channel_scale', 'transformer.layers.70.attention.dense.per_channel_scale', 'transformer.layers.54.mlp.gate.per_channel_scale', 'transformer.layers.14.attention.dense.per_channel_scale', 'transformer.layers.13.mlp.gate.per_channel_scale', 'transformer.layers.79.mlp.fc.per_channel_scale', 'transformer.layers.50.mlp.proj.per_channel_scale', 'transformer.layers.73.mlp.fc.per_channel_scale', 'transformer.layers.62.mlp.gate.per_channel_scale', 'transformer.layers.57.mlp.fc.per_channel_scale', 'transformer.layers.45.attention.dense.per_channel_scale', 'transformer.layers.56.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.gate.per_channel_scale', 'transformer.layers.2.mlp.gate.per_channel_scale', 'transformer.layers.63.attention.qkv.per_channel_scale', 'transformer.layers.67.mlp.fc.per_channel_scale', 'transformer.layers.12.mlp.proj.per_channel_scale', 'transformer.layers.22.mlp.proj.per_channel_scale', 'transformer.layers.77.attention.qkv.per_channel_scale', 'transformer.layers.8.attention.qkv.per_channel_scale', 'transformer.layers.65.attention.dense.per_channel_scale', 'transformer.layers.38.mlp.fc.per_channel_scale', 'transformer.layers.72.mlp.gate.per_channel_scale', 'transformer.layers.46.attention.qkv.per_channel_scale', 'transformer.layers.57.attention.dense.per_channel_scale', 'transformer.layers.52.attention.dense.per_channel_scale', 'transformer.layers.17.attention.dense.per_channel_scale', 'transformer.layers.45.attention.qkv.per_channel_scale', 'transformer.layers.73.attention.dense.per_channel_scale', 'transformer.layers.62.attention.qkv.per_channel_scale', 'transformer.layers.41.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.gate.per_channel_scale', 'transformer.layers.11.mlp.gate.per_channel_scale', 'transformer.layers.28.mlp.fc.per_channel_scale', 'transformer.layers.52.attention.qkv.per_channel_scale', 'transformer.layers.34.attention.dense.per_channel_scale', 'transformer.layers.24.attention.dense.per_channel_scale', 'transformer.layers.6.mlp.proj.per_channel_scale', 'transformer.layers.70.mlp.fc.per_channel_scale', 'transformer.layers.15.attention.dense.per_channel_scale', 'transformer.layers.55.attention.qkv.per_channel_scale', 'transformer.layers.53.mlp.proj.per_channel_scale', 'transformer.layers.50.attention.qkv.per_channel_scale', 'transformer.layers.20.attention.qkv.per_channel_scale', 'transformer.layers.78.mlp.proj.per_channel_scale', 'transformer.layers.69.mlp.fc.per_channel_scale', 'transformer.layers.27.attention.qkv.per_channel_scale', 'transformer.layers.21.mlp.fc.per_channel_scale', 'transformer.layers.16.mlp.gate.per_channel_scale', 'transformer.layers.59.attention.dense.per_channel_scale', 'transformer.layers.73.attention.qkv.per_channel_scale', 'transformer.layers.27.attention.dense.per_channel_scale', 'transformer.layers.51.attention.qkv.per_channel_scale', 'transformer.layers.66.mlp.gate.per_channel_scale', 'transformer.layers.44.mlp.proj.per_channel_scale', 'transformer.layers.78.attention.dense.per_channel_scale', 'transformer.layers.68.attention.dense.per_channel_scale', 'transformer.layers.9.attention.qkv.per_channel_scale', 'transformer.layers.52.mlp.gate.per_channel_scale', 'transformer.layers.21.attention.qkv.per_channel_scale', 'transformer.layers.79.mlp.proj.per_channel_scale', 'transformer.layers.6.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.gate.per_channel_scale', 'transformer.layers.10.mlp.gate.per_channel_scale', 'transformer.layers.31.mlp.gate.per_channel_scale', 'transformer.layers.78.mlp.gate.per_channel_scale', 'transformer.layers.40.attention.dense.per_channel_scale', 'transformer.layers.33.mlp.gate.per_channel_scale', 'transformer.layers.3.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.dense.per_channel_scale', 'transformer.layers.1.attention.qkv.per_channel_scale', 'transformer.layers.49.mlp.proj.per_channel_scale', 'transformer.layers.73.mlp.proj.per_channel_scale', 'transformer.layers.36.mlp.gate.per_channel_scale', 'transformer.layers.57.attention.qkv.per_channel_scale', 'transformer.layers.42.mlp.proj.per_channel_scale', 'transformer.layers.5.mlp.fc.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f0efac415a0>
Traceback (most recent call last):
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 377, in __del__
self.release()
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 374, in release
release_gc()
File "/app/tensorrt-llm/tensorrt_llm/_utils.py", line 443, in release_gc
torch.cuda.ipc_collect()
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 813, in ipc_collect
_lazy_init()
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable
CUDA call was originally invoked at:
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 9, in <module>
import tensorrt_llm
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/__init__.py", line 32, in <module>
import tensorrt_llm.functional as functional
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/functional.py", line 28, in <module>
from . import graph_rewriting as gw
File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/graph_rewriting.py", line 12, in <module>
from .network import Network
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/network.py", line 26, in <module>
from tensorrt_llm.module import Module
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/module.py", line 17, in <module>
from ._common import default_net
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/_common.py", line 26, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/venv_dev/lib/python3.10/site-packages/torch/__init__.py", line 1427, in <module>
_C._initExtension(manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1303, in <module>
_lazy_call(_register_triton_kernels)
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
@byshiue This seems a different error not related to XQA. Could you help triage and reroute the issue?
Thanks.
I also Met this before in https://github.com/NVIDIA/TensorRT-LLM/issues/1628
I don't have 70B model now. But I try 8B model and it works well under int4 wo
python3 examples/llama/convert_checkpoint.py --model_dir llama-v3-8b-instruct-hf/ --output_dir /tmp/llama-v3/ --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[05/23/2024-09:12:44] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.44it/s]
Weights loaded. Total time: 00:02:01
Total time of converting checkpoints: 00:02:14
Could you take a try?
@byshiue I had tried 8B and get the same error before. Noticed that you are using new version 0.11.0.dev2024052100. I will try this version.
@byshiue
I sync code to Update TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/pull/1639)
still get the same error. It is trying to check the loaded model contains quantized param like transformer.layers.0.attention.qkv.per_channel_scale
python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-8B-Instruct --output_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
Traceback (most recent call last):
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 471, in <module>
main()
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 455, in main
convert_and_save_hf(args)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 391, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 414, in execute
f(args, rank)
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 378, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/app/tensorrt-llm/tensorrt_llm/models/llama/model.py", line 280, in from_hugging_face
llama = convert.from_hugging_face(
File "/app/tensorrt-llm/tensorrt_llm/models/llama/convert.py", line 1337, in from_hugging_face
llama.load(weights)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 421, in load
raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.layers.13.attention.dense.per_channel_scale', 'transformer.layers.0.attention.qkv.per_channel_scale', 'transformer.layers.27.attention.dense.per_channel_scale', 'transformer.layers.24.attention.dense.per_channel_scale', 'transformer.layers.0.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.fc.per_channel_scale', 'transformer.layers.9.mlp.proj.per_channel_scale', 'transformer.layers.22.attention.dense.per_channel_scale', 'transformer.layers.8.mlp.fc.per_channel_scale', 'transformer.layers.11.mlp.proj.per_channel_scale', 'transformer.layers.5.attention.qkv.per_channel_scale', 'transformer.layers.23.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.gate.per_channel_scale', 'transformer.layers.12.mlp.proj.per_channel_scale', 'transformer.layers.20.attention.dense.per_channel_scale', 'transformer.layers.8.mlp.gate.per_channel_scale', 'transformer.layers.24.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.proj.per_channel_scale', 'transformer.layers.19.mlp.proj.per_channel_scale', 'transformer.layers.18.mlp.proj.per_channel_scale', 'transformer.layers.28.mlp.fc.per_channel_scale', 'transformer.layers.20.mlp.fc.per_channel_scale', 'transformer.layers.27.mlp.fc.per_channel_scale', 'transformer.layers.23.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.qkv.per_channel_scale', 'transformer.layers.9.mlp.fc.per_channel_scale', 'transformer.layers.1.mlp.fc.per_channel_scale', 'transformer.layers.14.mlp.gate.per_channel_scale', 'transformer.layers.29.mlp.fc.per_channel_scale', 'transformer.layers.4.mlp.fc.per_channel_scale', 'transformer.layers.13.attention.qkv.per_channel_scale', 'transformer.layers.6.mlp.gate.per_channel_scale', 'transformer.layers.13.mlp.proj.per_channel_scale', 'transformer.layers.23.attention.dense.per_channel_scale', 'transformer.layers.28.attention.qkv.per_channel_scale', 'transformer.layers.16.attention.dense.per_channel_scale', 'transformer.layers.12.mlp.gate.per_channel_scale', 'transformer.layers.14.mlp.fc.per_channel_scale', 'transformer.layers.27.mlp.proj.per_channel_scale', 'transformer.layers.21.attention.qkv.per_channel_scale', 'transformer.layers.5.mlp.fc.per_channel_scale', 'transformer.layers.4.attention.dense.per_channel_scale', 'transformer.layers.4.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.qkv.per_channel_scale', 'transformer.layers.18.attention.dense.per_channel_scale', 'transformer.layers.31.mlp.proj.per_channel_scale', 'transformer.layers.2.mlp.fc.per_channel_scale', 'transformer.layers.3.mlp.proj.per_channel_scale', 'transformer.layers.6.mlp.proj.per_channel_scale', 'transformer.layers.7.attention.qkv.per_channel_scale', 'transformer.layers.30.mlp.fc.per_channel_scale', 'transformer.layers.15.attention.qkv.per_channel_scale', 'transformer.layers.22.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.proj.per_channel_scale', 'transformer.layers.12.mlp.fc.per_channel_scale', 'transformer.layers.16.attention.qkv.per_channel_scale', 'transformer.layers.31.attention.dense.per_channel_scale', 'transformer.layers.18.mlp.gate.per_channel_scale', 'transformer.layers.8.attention.qkv.per_channel_scale', 'transformer.layers.2.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.proj.per_channel_scale', 'transformer.layers.1.attention.dense.per_channel_scale', 'transformer.layers.22.mlp.proj.per_channel_scale', 'transformer.layers.29.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.gate.per_channel_scale', 'transformer.layers.7.mlp.proj.per_channel_scale', 'transformer.layers.11.mlp.gate.per_channel_scale', 'transformer.layers.28.attention.dense.per_channel_scale', 'transformer.layers.31.attention.qkv.per_channel_scale', 'transformer.layers.10.mlp.fc.per_channel_scale', 'transformer.layers.28.mlp.proj.per_channel_scale', 'transformer.layers.28.mlp.gate.per_channel_scale', 'transformer.layers.6.mlp.fc.per_channel_scale', 'transformer.layers.14.attention.dense.per_channel_scale', 'transformer.layers.25.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.gate.per_channel_scale', 'transformer.layers.11.attention.qkv.per_channel_scale', 'transformer.layers.18.mlp.fc.per_channel_scale', 'transformer.layers.8.attention.dense.per_channel_scale', 'transformer.layers.6.attention.dense.per_channel_scale', 'transformer.layers.5.mlp.gate.per_channel_scale', 'transformer.layers.0.attention.dense.per_channel_scale', 'transformer.layers.23.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.fc.per_channel_scale', 'transformer.layers.2.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.fc.per_channel_scale', 'transformer.layers.10.attention.qkv.per_channel_scale', 'transformer.layers.3.attention.qkv.per_channel_scale', 'transformer.layers.7.mlp.gate.per_channel_scale', 'transformer.layers.30.attention.dense.per_channel_scale', 'transformer.layers.20.mlp.gate.per_channel_scale', 'transformer.layers.15.attention.dense.per_channel_scale', 'transformer.layers.14.attention.qkv.per_channel_scale', 'transformer.layers.15.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.proj.per_channel_scale', 'transformer.layers.24.mlp.gate.per_channel_scale', 'transformer.layers.8.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.qkv.per_channel_scale', 'transformer.layers.19.attention.dense.per_channel_scale', 'transformer.layers.1.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.gate.per_channel_scale', 'transformer.layers.26.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.gate.per_channel_scale', 'transformer.layers.27.attention.qkv.per_channel_scale', 'transformer.layers.2.mlp.gate.per_channel_scale', 'transformer.layers.9.attention.qkv.per_channel_scale', 'transformer.layers.9.attention.dense.per_channel_scale', 'transformer.layers.5.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.fc.per_channel_scale', 'transformer.layers.25.attention.dense.per_channel_scale', 'transformer.layers.20.attention.qkv.per_channel_scale', 'transformer.layers.26.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.proj.per_channel_scale', 'transformer.layers.31.mlp.gate.per_channel_scale', 'transformer.layers.30.mlp.gate.per_channel_scale', 'transformer.layers.21.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.gate.per_channel_scale', 'transformer.layers.12.attention.dense.per_channel_scale', 'transformer.layers.17.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.dense.per_channel_scale', 'transformer.layers.26.attention.qkv.per_channel_scale', 'transformer.layers.21.attention.dense.per_channel_scale', 'transformer.layers.5.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.proj.per_channel_scale', 'transformer.layers.19.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.fc.per_channel_scale', 'transformer.layers.26.mlp.gate.per_channel_scale', 'transformer.layers.22.mlp.fc.per_channel_scale', 'transformer.layers.7.mlp.fc.per_channel_scale', 'transformer.layers.13.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.proj.per_channel_scale', 'transformer.layers.10.mlp.proj.per_channel_scale', 'transformer.layers.3.mlp.gate.per_channel_scale', 'transformer.layers.17.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.fc.per_channel_scale', 'transformer.layers.1.mlp.gate.per_channel_scale', 'transformer.layers.30.attention.qkv.per_channel_scale', 'transformer.layers.13.mlp.fc.per_channel_scale', 'transformer.layers.2.attention.dense.per_channel_scale', 'transformer.layers.23.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.fc.per_channel_scale', 'transformer.layers.10.mlp.gate.per_channel_scale', 'transformer.layers.11.mlp.fc.per_channel_scale', 'transformer.layers.3.attention.dense.per_channel_scale', 'transformer.layers.25.mlp.proj.per_channel_scale', 'transformer.layers.7.attention.dense.per_channel_scale', 'transformer.layers.16.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.dense.per_channel_scale', 'transformer.layers.10.attention.dense.per_channel_scale', 'transformer.layers.0.mlp.fc.per_channel_scale', 'transformer.layers.11.attention.dense.per_channel_scale', 'transformer.layers.9.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.qkv.per_channel_scale', 'transformer.layers.22.mlp.gate.per_channel_scale', 'transformer.layers.17.mlp.gate.per_channel_scale', 'transformer.layers.6.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.fc.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f9b6817de10>
Traceback (most recent call last):
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 377, in __del__
self.release()
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 374, in release
release_gc()
File "/app/tensorrt-llm/tensorrt_llm/_utils.py", line 443, in release_gc
torch.cuda.ipc_collect()
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 813, in ipc_collect
_lazy_init()
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable
CUDA call was originally invoked at:
File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 9, in <module>
import tensorrt_llm
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/__init__.py", line 32, in <module>
import tensorrt_llm.functional as functional
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/functional.py", line 28, in <module>
from . import graph_rewriting as gw
File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/graph_rewriting.py", line 12, in <module>
from .network import Network
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/network.py", line 26, in <module>
from tensorrt_llm.module import Module
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/module.py", line 17, in <module>
from ._common import default_net
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/tensorrt-llm/tensorrt_llm/_common.py", line 26, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/venv_dev/lib/python3.10/site-packages/torch/__init__.py", line 1427, in <module>
_C._initExtension(manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1303, in <module>
_lazy_call(_register_triton_kernels)
File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
I also tried in a clean docker envi, the same error.
This issue would be fixed in next main branch update. Please give a try at the time.
This issue is fixed in latest main branch (commit id: f430a4b447ef4cba22698902d43eae0debf08594). Could you take a try?
works now. Thank you very mush
I am getting this error now when I am trying to convert a fine tuned llama3 8b gptq safetensor. Does the patch f430a4b addresses gptq ?
Could you file a new issue to share the error you encounter and the reproduced steps?