TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Fail to build Llama-3-70B-Instruct with w4a16

Open gloritygithub11 opened this issue 1 year ago • 11 comments
trafficstars

System Info

tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.10.0.dev2024050700

A100 40G

Who can help?

@byshiue

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

set -ex

export MODEL_DIR=/mnt/memory
export MODEL_OUTPUT_DIR=$MODEL_DIR/tmp

export MODEL_NAME=Meta-Llama-3-70B-Instruct

export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/tensorrt/bin:$PATH
export DTYPE=float16
export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH
export WEIGHT_BIT=4
export PARALLEL_SIZE=1


export QUANTIZE=w${WEIGHT_BIT}a16
export WEIGHT_PRESISION=int${WEIGHT_BIT}
export QUANTIZE_PARAM="--use_weight_only --weight_only_precision $WEIGHT_PRESISION"


python ../llama/convert_checkpoint.py \
    --model_dir $MODEL_DIR/${MODEL_NAME} \
    --output_dir $MODEL_OUTPUT_DIR/trt_models/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
    --dtype $DTYPE $QUANTIZE_PARAM \
    --load_model_on_cpu


trtllm-build \
    --checkpoint_dir $MODEL_OUTPUT_DIR/trt_models/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
    --output_dir $MODEL_OUTPUT_DIR/trt_engines/${MODEL_NAME}/$QUANTIZE/${PARALLEL_SIZE}-gpu \
    --gemm_plugin $DTYPE \
    --gpt_attention_plugin $DTYPE \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024

Expected behavior

build success

actual behavior

trtllm-build with error

[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/62/post_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/60/post_layernorm/SHUFFLE_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/post_layernorm/ELEMENTWISE_POW_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/55/mlp/gate/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/69/input_layernorm/CONSTANT_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/45/input_layernorm/SHUFFLE_0_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[05/20/2024-14:41:10] [TRT] [W] Detected layernorm nodes in FP16.
[05/20/2024-14:41:10] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[05/20/2024-14:41:10] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/20/2024-14:41:10] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/20/2024-14:41:22] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[05/20/2024-14:41:22] [TRT] [I] Detected 14 inputs and 1 output network tensors.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] TllmXqaJit runtime error in tllmXqaJitCreateAndCompileProgram(&program, &context): NVRTC Internal Error (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/compileEngine.cpp:65)
1       0x7fe4b37bb384 /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(+0x68f384) [0x7fe4b37bb384]
2       0x7fe4b38f9855 tensorrt_llm::kernels::jit::CompileEngine::compile() const + 165
3       0x7fe4b38faaf8 tensorrt_llm::kernels::DecoderXQAImplJIT::prepare(tensorrt_llm::kernels::XQAParams const&) + 296
4       0x7fe47c5fc508 /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x9b508) [0x7fe47c5fc508]
5       0x7fe47c61855a /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb755a) [0x7fe47c61855a]
6       0x7fe57c888f38 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd87f38) [0x7fe57c888f38]
7       0x7fe57c88985c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd8885c) [0x7fe57c88985c]
8       0x7fe57c902caf /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xe01caf) [0x7fe57c902caf]
9       0x7fe57c8db4e0 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xdda4e0) [0x7fe57c8db4e0]
10      0x7fe57c8e207c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c) [0x7fe57c8e207c]
11      0x7fe57c8e4071 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071) [0x7fe57c8e4071]
12      0x7fe57c52961c /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c) [0x7fe57c52961c]
13      0x7fe57c52e837 /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837) [0x7fe57c52e837]
14      0x7fe57c52f1af /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af) [0x7fe57c52f1af]
15      0x7fe5293e2478 /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478) [0x7fe5293e2478]
16      0x7fe5293817a3 /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3) [0x7fe5293817a3]
17            0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
18            0x629d24 _PyObject_MakeTpCall + 356
19            0x549c2e /app/venv_dev/bin/python3() [0x549c2e]
20            0x5ae603 _PyEval_EvalFrameDefault + 19699
21            0x628d60 _PyFunction_Vectorcall + 592
22            0x628a3a PyObject_Call + 426
23            0x5ac51b _PyEval_EvalFrameDefault + 11275
24            0x628d60 _PyFunction_Vectorcall + 592
25            0x5aa025 _PyEval_EvalFrameDefault + 1813
26            0x628d60 _PyFunction_Vectorcall + 592
27            0x5a9c1b _PyEval_EvalFrameDefault + 779
28            0x628d60 _PyFunction_Vectorcall + 592
29            0x62893c PyObject_Call + 172
30            0x5ac51b _PyEval_EvalFrameDefault + 11275
31            0x628d60 _PyFunction_Vectorcall + 592
32            0x62893c PyObject_Call + 172
33            0x5ac51b _PyEval_EvalFrameDefault + 11275
34            0x628d60 _PyFunction_Vectorcall + 592
35            0x62893c PyObject_Call + 172
36            0x5ac51b _PyEval_EvalFrameDefault + 11275
37            0x628d60 _PyFunction_Vectorcall + 592
38            0x5a9c1b _PyEval_EvalFrameDefault + 779
39            0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
40            0x6d77cf PyEval_EvalCode + 127
41            0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
42            0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
43            0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
44            0x6c0c84 _PyRun_SimpleFileObject + 404
45            0x6c0d57 _PyRun_AnyFileObject + 71
46            0x7042dd Py_RunMain + 877
47            0x7044bd Py_BytesMain + 45
48      0x7fe6e0245083 __libc_start_main + 243
49            0x62ff4e _start + 46
[tensorrt-llm-build-95p6x:320425] *** Process received signal ***
[tensorrt-llm-build-95p6x:320425] Signal: Aborted (6)
[tensorrt-llm-build-95p6x:320425] Signal code:  (-6)
[tensorrt-llm-build-95p6x:320425] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe6e0264090]
[tensorrt-llm-build-95p6x:320425] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fe6e026400b]
[tensorrt-llm-build-95p6x:320425] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fe6e0243859]
[tensorrt-llm-build-95p6x:320425] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e8d1)[0x7fe63e03d8d1]
[tensorrt-llm-build-95p6x:320425] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c)[0x7fe63e04937c]
[tensorrt-llm-build-95p6x:320425] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9359)[0x7fe63e048359]
[tensorrt-llm-build-95p6x:320425] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7fe63e048d11]
[tensorrt-llm-build-95p6x:320425] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bff)[0x7fe6dfe0fbff]
[tensorrt-llm-build-95p6x:320425] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fe6dfe105ba]
[tensorrt-llm-build-95p6x:320425] [ 9] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(+0x68f3d2)[0x7fe4b37bb3d2]
[tensorrt-llm-build-95p6x:320425] [10] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(_ZNK12tensorrt_llm7kernels3jit13CompileEngine7compileEv+0xa5)[0x7fe4b38f9855]
[tensorrt-llm-build-95p6x:320425] [11] /app/tensorrt-llm/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7kernels17DecoderXQAImplJIT7prepareERKNS0_9XQAParamsE+0x128)[0x7fe4b38faaf8]
[tensorrt-llm-build-95p6x:320425] [12] /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x9b508)[0x7fe47c5fc508]
[tensorrt-llm-build-95p6x:320425] [13] /app/tensorrt-llm/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xb755a)[0x7fe47c61855a]
[tensorrt-llm-build-95p6x:320425] [14] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd87f38)[0x7fe57c888f38]
[tensorrt-llm-build-95p6x:320425] [15] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xd8885c)[0x7fe57c88985c]
[tensorrt-llm-build-95p6x:320425] [16] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xe01caf)[0x7fe57c902caf]
[tensorrt-llm-build-95p6x:320425] [17] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xdda4e0)[0x7fe57c8db4e0]
[tensorrt-llm-build-95p6x:320425] [18] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde107c)[0x7fe57c8e207c]
[tensorrt-llm-build-95p6x:320425] [19] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xde3071)[0x7fe57c8e4071]
[tensorrt-llm-build-95p6x:320425] [20] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2861c)[0x7fe57c52961c]
[tensorrt-llm-build-95p6x:320425] [21] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2d837)[0x7fe57c52e837]
[tensorrt-llm-build-95p6x:320425] [22] /app/venv_dev/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.10(+0xa2e1af)[0x7fe57c52f1af]
[tensorrt-llm-build-95p6x:320425] [23] /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0xa6478)[0x7fe5293e2478]
[tensorrt-llm-build-95p6x:320425] [24] /app/venv_dev/lib/python3.10/site-packages/tensorrt_bindings/tensorrt.so(+0x457a3)[0x7fe5293817a3]
[tensorrt-llm-build-95p6x:320425] [25] /app/venv_dev/bin/python3[0x53bd79]
[tensorrt-llm-build-95p6x:320425] [26] /app/venv_dev/bin/python3(_PyObject_MakeTpCall+0x164)[0x629d24]
[tensorrt-llm-build-95p6x:320425] [27] /app/venv_dev/bin/python3[0x549c2e]
[tensorrt-llm-build-95p6x:320425] [28] /app/venv_dev/bin/python3(_PyEval_EvalFrameDefault+0x4cf3)[0x5ae603]
[tensorrt-llm-build-95p6x:320425] [29] /app/venv_dev/bin/python3(_PyFunction_Vectorcall+0x250)[0x628d60]
[tensorrt-llm-build-95p6x:320425] *** End of error message ***
Aborted (core dumped)


additional notes

Meta-Llama-3-70B-Instruct w4a16 supported right now?

gloritygithub11 avatar May 20 '24 14:05 gloritygithub11

Thanks for reporting the issue.

This issue has been fixed and the fix will be included in the future update.

ming-wei avatar May 21 '24 08:05 ming-wei

Actually, the issue should have been fixed already in the update last week (0514).

@gloritygithub11 could you try with tensorrt-llm 0.10.0.dev2024051400 and let us know whether the issue is still reproducible?

Thanks!

ming-wei avatar May 21 '24 09:05 ming-wei

@ming-wei Thanks. I will try it

gloritygithub11 avatar May 21 '24 09:05 gloritygithub11

@ming-wei I have synced to commit "[Fix mistral v0.1 build instructions (#1373)]", now it failed in convert with error:

python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-70B-Instruct --output_dir /app/models/tmp/trt_models/Meta-Llama-3-70B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400
0.11.0.dev2024051400
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 464, in <module>
    main()
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 448, in main
    convert_and_save_hf(args)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 384, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 407, in execute
    f(args, rank)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 371, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/app/tensorrt-llm/tensorrt_llm/models/llama/model.py", line 280, in from_hugging_face
    llama = convert.from_hugging_face(
  File "/app/tensorrt-llm/tensorrt_llm/models/llama/convert.py", line 1337, in from_hugging_face
    llama.load(weights)
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 419, in load
    raise RuntimeError(
RuntimeError: Expected but not provided tensors:{'transformer.layers.6.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.fc.per_channel_scale', 'transformer.layers.13.attention.dense.per_channel_scale', 'transformer.layers.26.attention.dense.per_channel_scale', 'transformer.layers.32.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.gate.per_channel_scale', 'transformer.layers.74.attention.qkv.per_channel_scale', 'transformer.layers.49.mlp.gate.per_channel_scale', 'transformer.layers.41.attention.dense.per_channel_scale', 'transformer.layers.22.attention.qkv.per_channel_scale', 'transformer.layers.21.mlp.proj.per_channel_scale', 'transformer.layers.11.attention.dense.per_channel_scale', 'transformer.layers.67.mlp.gate.per_channel_scale', 'transformer.layers.7.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.fc.per_channel_scale', 'transformer.layers.24.mlp.gate.per_channel_scale', 'transformer.layers.77.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.qkv.per_channel_scale', 'transformer.layers.30.attention.qkv.per_channel_scale', 'transformer.layers.67.mlp.proj.per_channel_scale', 'transformer.layers.9.attention.dense.per_channel_scale', 'transformer.layers.47.mlp.fc.per_channel_scale', 'transformer.layers.68.attention.qkv.per_channel_scale', 'transformer.layers.59.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.proj.per_channel_scale', 'transformer.layers.58.mlp.fc.per_channel_scale', 'transformer.layers.36.attention.qkv.per_channel_scale', 'transformer.layers.78.mlp.fc.per_channel_scale', 'transformer.layers.70.mlp.proj.per_channel_scale', 'transformer.layers.15.attention.qkv.per_channel_scale', 'transformer.layers.6.attention.qkv.per_channel_scale', 'transformer.layers.43.mlp.proj.per_channel_scale', 'transformer.layers.67.attention.dense.per_channel_scale', 'transformer.layers.28.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.gate.per_channel_scale', 'transformer.layers.73.mlp.gate.per_channel_scale', 'transformer.layers.29.mlp.proj.per_channel_scale', 'transformer.layers.59.mlp.fc.per_channel_scale', 'transformer.layers.60.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.dense.per_channel_scale', 'transformer.layers.79.mlp.gate.per_channel_scale', 'transformer.layers.58.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.proj.per_channel_scale', 'transformer.layers.74.mlp.proj.per_channel_scale', 'transformer.layers.48.attention.qkv.per_channel_scale', 'transformer.layers.54.attention.dense.per_channel_scale', 'transformer.layers.70.mlp.gate.per_channel_scale', 'transformer.layers.48.mlp.proj.per_channel_scale', 'transformer.layers.65.attention.qkv.per_channel_scale', 'transformer.layers.48.mlp.gate.per_channel_scale', 'transformer.layers.48.mlp.fc.per_channel_scale', 'transformer.layers.78.attention.qkv.per_channel_scale', 'transformer.layers.71.mlp.gate.per_channel_scale', 'transformer.layers.5.attention.dense.per_channel_scale', 'transformer.layers.62.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.proj.per_channel_scale', 'transformer.layers.40.mlp.gate.per_channel_scale', 'transformer.layers.60.attention.qkv.per_channel_scale', 'transformer.layers.21.attention.dense.per_channel_scale', 'transformer.layers.48.attention.dense.per_channel_scale', 'transformer.layers.0.attention.qkv.per_channel_scale', 'transformer.layers.32.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.fc.per_channel_scale', 'transformer.layers.9.mlp.proj.per_channel_scale', 'transformer.layers.12.mlp.gate.per_channel_scale', 'transformer.layers.49.attention.dense.per_channel_scale', 'transformer.layers.72.mlp.proj.per_channel_scale', 'transformer.layers.22.mlp.gate.per_channel_scale', 'transformer.layers.8.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.fc.per_channel_scale', 'transformer.layers.3.attention.qkv.per_channel_scale', 'transformer.layers.76.attention.qkv.per_channel_scale', 'transformer.layers.69.attention.qkv.per_channel_scale', 'transformer.layers.8.mlp.gate.per_channel_scale', 'transformer.layers.63.mlp.gate.per_channel_scale', 'transformer.layers.24.mlp.fc.per_channel_scale', 'transformer.layers.43.attention.qkv.per_channel_scale', 'transformer.layers.35.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.fc.per_channel_scale', 'transformer.layers.17.mlp.fc.per_channel_scale', 'transformer.layers.79.attention.dense.per_channel_scale', 'transformer.layers.64.attention.qkv.per_channel_scale', 'transformer.layers.7.mlp.fc.per_channel_scale', 'transformer.layers.77.attention.dense.per_channel_scale', 'transformer.layers.66.attention.qkv.per_channel_scale', 'transformer.layers.50.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.proj.per_channel_scale', 'transformer.layers.25.attention.dense.per_channel_scale', 'transformer.layers.12.mlp.fc.per_channel_scale', 'transformer.layers.71.mlp.fc.per_channel_scale', 'transformer.layers.50.attention.dense.per_channel_scale', 'transformer.layers.16.attention.dense.per_channel_scale', 'transformer.layers.61.attention.qkv.per_channel_scale', 'transformer.layers.72.mlp.fc.per_channel_scale', 'transformer.layers.77.mlp.gate.per_channel_scale', 'transformer.layers.11.attention.qkv.per_channel_scale', 'transformer.layers.34.mlp.gate.per_channel_scale', 'transformer.layers.13.attention.qkv.per_channel_scale', 'transformer.layers.6.mlp.gate.per_channel_scale', 'transformer.layers.40.mlp.proj.per_channel_scale', 'transformer.layers.50.mlp.gate.per_channel_scale', 'transformer.layers.52.mlp.fc.per_channel_scale', 'transformer.layers.65.mlp.gate.per_channel_scale', 'transformer.layers.74.attention.dense.per_channel_scale', 'transformer.layers.45.mlp.gate.per_channel_scale', 'transformer.layers.69.mlp.proj.per_channel_scale', 'transformer.layers.41.mlp.fc.per_channel_scale', 'transformer.layers.53.mlp.fc.per_channel_scale', 'transformer.layers.17.mlp.proj.per_channel_scale', 'transformer.layers.56.mlp.gate.per_channel_scale', 'transformer.layers.2.attention.dense.per_channel_scale', 'transformer.layers.61.mlp.fc.per_channel_scale', 'transformer.layers.8.mlp.proj.per_channel_scale', 'transformer.layers.23.mlp.proj.per_channel_scale', 'transformer.layers.37.mlp.gate.per_channel_scale', 'transformer.layers.16.attention.qkv.per_channel_scale', 'transformer.layers.30.mlp.proj.per_channel_scale', 'transformer.layers.44.mlp.gate.per_channel_scale', 'transformer.layers.76.mlp.gate.per_channel_scale', 'transformer.layers.41.mlp.gate.per_channel_scale', 'transformer.layers.62.mlp.fc.per_channel_scale', 'transformer.layers.59.mlp.proj.per_channel_scale', 'transformer.layers.44.mlp.fc.per_channel_scale', 'transformer.layers.13.mlp.fc.per_channel_scale', 'transformer.layers.64.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.dense.per_channel_scale', 'transformer.layers.14.mlp.proj.per_channel_scale', 'transformer.layers.17.mlp.gate.per_channel_scale', 'transformer.layers.32.mlp.fc.per_channel_scale', 'transformer.layers.15.mlp.gate.per_channel_scale', 'transformer.layers.33.mlp.fc.per_channel_scale', 'transformer.layers.2.attention.qkv.per_channel_scale', 'transformer.layers.5.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.qkv.per_channel_scale', 'transformer.layers.55.mlp.proj.per_channel_scale', 'transformer.layers.39.mlp.gate.per_channel_scale', 'transformer.layers.34.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.fc.per_channel_scale', 'transformer.layers.31.attention.dense.per_channel_scale', 'transformer.layers.54.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.proj.per_channel_scale', 'transformer.layers.25.attention.qkv.per_channel_scale', 'transformer.layers.23.attention.qkv.per_channel_scale', 'transformer.layers.11.mlp.proj.per_channel_scale', 'transformer.layers.52.mlp.proj.per_channel_scale', 'transformer.layers.53.attention.qkv.per_channel_scale', 'transformer.layers.76.attention.dense.per_channel_scale', 'transformer.layers.10.attention.dense.per_channel_scale', 'transformer.layers.76.mlp.fc.per_channel_scale', 'transformer.layers.18.mlp.gate.per_channel_scale', 'transformer.layers.75.attention.qkv.per_channel_scale', 'transformer.layers.36.attention.dense.per_channel_scale', 'transformer.layers.42.attention.qkv.per_channel_scale', 'transformer.layers.23.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.proj.per_channel_scale', 'transformer.layers.51.mlp.proj.per_channel_scale', 'transformer.layers.55.attention.dense.per_channel_scale', 'transformer.layers.33.mlp.proj.per_channel_scale', 'transformer.layers.0.attention.dense.per_channel_scale', 'transformer.layers.22.mlp.fc.per_channel_scale', 'transformer.layers.46.mlp.fc.per_channel_scale', 'transformer.layers.39.mlp.proj.per_channel_scale', 'transformer.layers.43.mlp.gate.per_channel_scale', 'transformer.layers.45.mlp.fc.per_channel_scale', 'transformer.layers.65.mlp.fc.per_channel_scale', 'transformer.layers.60.attention.dense.per_channel_scale', 'transformer.layers.36.mlp.fc.per_channel_scale', 'transformer.layers.72.attention.dense.per_channel_scale', 'transformer.layers.55.mlp.fc.per_channel_scale', 'transformer.layers.74.mlp.fc.per_channel_scale', 'transformer.layers.18.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.qkv.per_channel_scale', 'transformer.layers.51.mlp.gate.per_channel_scale', 'transformer.layers.44.attention.qkv.per_channel_scale', 'transformer.layers.79.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.fc.per_channel_scale', 'transformer.layers.33.attention.dense.per_channel_scale', 'transformer.layers.2.mlp.proj.per_channel_scale', 'transformer.layers.34.mlp.proj.per_channel_scale', 'transformer.layers.74.mlp.gate.per_channel_scale', 'transformer.layers.5.mlp.proj.per_channel_scale', 'transformer.layers.59.mlp.gate.per_channel_scale', 'transformer.layers.7.mlp.proj.per_channel_scale', 'transformer.layers.49.mlp.fc.per_channel_scale', 'transformer.layers.71.attention.dense.per_channel_scale', 'transformer.layers.71.mlp.proj.per_channel_scale', 'transformer.layers.76.mlp.proj.per_channel_scale', 'transformer.layers.60.mlp.fc.per_channel_scale', 'transformer.layers.7.mlp.gate.per_channel_scale', 'transformer.layers.67.attention.qkv.per_channel_scale', 'transformer.layers.75.attention.dense.per_channel_scale', 'transformer.layers.75.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.gate.per_channel_scale', 'transformer.layers.61.mlp.proj.per_channel_scale', 'transformer.layers.28.attention.qkv.per_channel_scale', 'transformer.layers.51.attention.dense.per_channel_scale', 'transformer.layers.40.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.qkv.per_channel_scale', 'transformer.layers.36.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.gate.per_channel_scale', 'transformer.layers.58.mlp.gate.per_channel_scale', 'transformer.layers.64.mlp.proj.per_channel_scale', 'transformer.layers.33.attention.qkv.per_channel_scale', 'transformer.layers.41.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.proj.per_channel_scale', 'transformer.layers.14.attention.qkv.per_channel_scale', 'transformer.layers.10.mlp.fc.per_channel_scale', 'transformer.layers.47.attention.qkv.per_channel_scale', 'transformer.layers.53.attention.dense.per_channel_scale', 'transformer.layers.39.attention.qkv.per_channel_scale', 'transformer.layers.34.mlp.fc.per_channel_scale', 'transformer.layers.20.attention.dense.per_channel_scale', 'transformer.layers.42.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.proj.per_channel_scale', 'transformer.layers.47.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.gate.per_channel_scale', 'transformer.layers.9.mlp.gate.per_channel_scale', 'transformer.layers.39.attention.dense.per_channel_scale', 'transformer.layers.38.mlp.gate.per_channel_scale', 'transformer.layers.55.mlp.gate.per_channel_scale', 'transformer.layers.39.mlp.fc.per_channel_scale', 'transformer.layers.28.attention.dense.per_channel_scale', 'transformer.layers.3.attention.dense.per_channel_scale', 'transformer.layers.60.mlp.proj.per_channel_scale', 'transformer.layers.30.attention.dense.per_channel_scale', 'transformer.layers.42.mlp.fc.per_channel_scale', 'transformer.layers.58.attention.qkv.per_channel_scale', 'transformer.layers.9.mlp.fc.per_channel_scale', 'transformer.layers.10.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.proj.per_channel_scale', 'transformer.layers.43.attention.dense.per_channel_scale', 'transformer.layers.43.mlp.fc.per_channel_scale', 'transformer.layers.49.attention.qkv.per_channel_scale', 'transformer.layers.58.attention.dense.per_channel_scale', 'transformer.layers.19.mlp.gate.per_channel_scale', 'transformer.layers.37.attention.qkv.per_channel_scale', 'transformer.layers.19.attention.dense.per_channel_scale', 'transformer.layers.69.mlp.gate.per_channel_scale', 'transformer.layers.32.attention.qkv.per_channel_scale', 'transformer.layers.45.mlp.proj.per_channel_scale', 'transformer.layers.51.mlp.fc.per_channel_scale', 'transformer.layers.35.mlp.proj.per_channel_scale', 'transformer.layers.54.mlp.fc.per_channel_scale', 'transformer.layers.35.attention.dense.per_channel_scale', 'transformer.layers.61.mlp.gate.per_channel_scale', 'transformer.layers.65.mlp.proj.per_channel_scale', 'transformer.layers.38.mlp.proj.per_channel_scale', 'transformer.layers.2.mlp.fc.per_channel_scale', 'transformer.layers.23.mlp.fc.per_channel_scale', 'transformer.layers.75.mlp.fc.per_channel_scale', 'transformer.layers.47.attention.dense.per_channel_scale', 'transformer.layers.29.mlp.fc.per_channel_scale', 'transformer.layers.69.attention.dense.per_channel_scale', 'transformer.layers.68.mlp.fc.per_channel_scale', 'transformer.layers.42.attention.dense.per_channel_scale', 'transformer.layers.13.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.fc.per_channel_scale', 'transformer.layers.66.attention.dense.per_channel_scale', 'transformer.layers.1.attention.dense.per_channel_scale', 'transformer.layers.63.mlp.proj.per_channel_scale', 'transformer.layers.62.attention.dense.per_channel_scale', 'transformer.layers.8.attention.dense.per_channel_scale', 'transformer.layers.57.mlp.gate.per_channel_scale', 'transformer.layers.46.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.proj.per_channel_scale', 'transformer.layers.54.mlp.proj.per_channel_scale', 'transformer.layers.38.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.proj.per_channel_scale', 'transformer.layers.66.mlp.fc.per_channel_scale', 'transformer.layers.64.attention.dense.per_channel_scale', 'transformer.layers.23.attention.dense.per_channel_scale', 'transformer.layers.26.mlp.gate.per_channel_scale', 'transformer.layers.20.mlp.proj.per_channel_scale', 'transformer.layers.5.attention.qkv.per_channel_scale', 'transformer.layers.28.mlp.gate.per_channel_scale', 'transformer.layers.47.mlp.proj.per_channel_scale', 'transformer.layers.0.mlp.gate.per_channel_scale', 'transformer.layers.75.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.fc.per_channel_scale', 'transformer.layers.25.mlp.fc.per_channel_scale', 'transformer.layers.37.mlp.proj.per_channel_scale', 'transformer.layers.56.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.dense.per_channel_scale', 'transformer.layers.38.attention.qkv.per_channel_scale', 'transformer.layers.35.attention.qkv.per_channel_scale', 'transformer.layers.32.attention.dense.per_channel_scale', 'transformer.layers.71.attention.qkv.per_channel_scale', 'transformer.layers.66.mlp.proj.per_channel_scale', 'transformer.layers.19.attention.qkv.per_channel_scale', 'transformer.layers.72.attention.qkv.per_channel_scale', 'transformer.layers.46.attention.dense.per_channel_scale', 'transformer.layers.35.mlp.fc.per_channel_scale', 'transformer.layers.77.mlp.fc.per_channel_scale', 'transformer.layers.56.attention.dense.per_channel_scale', 'transformer.layers.31.attention.qkv.per_channel_scale', 'transformer.layers.63.mlp.fc.per_channel_scale', 'transformer.layers.68.mlp.proj.per_channel_scale', 'transformer.layers.37.mlp.fc.per_channel_scale', 'transformer.layers.37.attention.dense.per_channel_scale', 'transformer.layers.61.attention.dense.per_channel_scale', 'transformer.layers.10.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.gate.per_channel_scale', 'transformer.layers.26.attention.qkv.per_channel_scale', 'transformer.layers.40.attention.qkv.per_channel_scale', 'transformer.layers.22.attention.dense.per_channel_scale', 'transformer.layers.70.attention.qkv.per_channel_scale', 'transformer.layers.16.mlp.fc.per_channel_scale', 'transformer.layers.7.attention.dense.per_channel_scale', 'transformer.layers.46.mlp.gate.per_channel_scale', 'transformer.layers.64.mlp.gate.per_channel_scale', 'transformer.layers.68.mlp.gate.per_channel_scale', 'transformer.layers.19.mlp.fc.per_channel_scale', 'transformer.layers.63.attention.dense.per_channel_scale', 'transformer.layers.44.attention.dense.per_channel_scale', 'transformer.layers.27.mlp.fc.per_channel_scale', 'transformer.layers.57.mlp.proj.per_channel_scale', 'transformer.layers.11.mlp.fc.per_channel_scale', 'transformer.layers.24.attention.qkv.per_channel_scale', 'transformer.layers.18.mlp.fc.per_channel_scale', 'transformer.layers.56.mlp.fc.per_channel_scale', 'transformer.layers.53.mlp.gate.per_channel_scale', 'transformer.layers.4.mlp.gate.per_channel_scale', 'transformer.layers.70.attention.dense.per_channel_scale', 'transformer.layers.54.mlp.gate.per_channel_scale', 'transformer.layers.14.attention.dense.per_channel_scale', 'transformer.layers.13.mlp.gate.per_channel_scale', 'transformer.layers.79.mlp.fc.per_channel_scale', 'transformer.layers.50.mlp.proj.per_channel_scale', 'transformer.layers.73.mlp.fc.per_channel_scale', 'transformer.layers.62.mlp.gate.per_channel_scale', 'transformer.layers.57.mlp.fc.per_channel_scale', 'transformer.layers.45.attention.dense.per_channel_scale', 'transformer.layers.56.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.gate.per_channel_scale', 'transformer.layers.2.mlp.gate.per_channel_scale', 'transformer.layers.63.attention.qkv.per_channel_scale', 'transformer.layers.67.mlp.fc.per_channel_scale', 'transformer.layers.12.mlp.proj.per_channel_scale', 'transformer.layers.22.mlp.proj.per_channel_scale', 'transformer.layers.77.attention.qkv.per_channel_scale', 'transformer.layers.8.attention.qkv.per_channel_scale', 'transformer.layers.65.attention.dense.per_channel_scale', 'transformer.layers.38.mlp.fc.per_channel_scale', 'transformer.layers.72.mlp.gate.per_channel_scale', 'transformer.layers.46.attention.qkv.per_channel_scale', 'transformer.layers.57.attention.dense.per_channel_scale', 'transformer.layers.52.attention.dense.per_channel_scale', 'transformer.layers.17.attention.dense.per_channel_scale', 'transformer.layers.45.attention.qkv.per_channel_scale', 'transformer.layers.73.attention.dense.per_channel_scale', 'transformer.layers.62.attention.qkv.per_channel_scale', 'transformer.layers.41.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.gate.per_channel_scale', 'transformer.layers.11.mlp.gate.per_channel_scale', 'transformer.layers.28.mlp.fc.per_channel_scale', 'transformer.layers.52.attention.qkv.per_channel_scale', 'transformer.layers.34.attention.dense.per_channel_scale', 'transformer.layers.24.attention.dense.per_channel_scale', 'transformer.layers.6.mlp.proj.per_channel_scale', 'transformer.layers.70.mlp.fc.per_channel_scale', 'transformer.layers.15.attention.dense.per_channel_scale', 'transformer.layers.55.attention.qkv.per_channel_scale', 'transformer.layers.53.mlp.proj.per_channel_scale', 'transformer.layers.50.attention.qkv.per_channel_scale', 'transformer.layers.20.attention.qkv.per_channel_scale', 'transformer.layers.78.mlp.proj.per_channel_scale', 'transformer.layers.69.mlp.fc.per_channel_scale', 'transformer.layers.27.attention.qkv.per_channel_scale', 'transformer.layers.21.mlp.fc.per_channel_scale', 'transformer.layers.16.mlp.gate.per_channel_scale', 'transformer.layers.59.attention.dense.per_channel_scale', 'transformer.layers.73.attention.qkv.per_channel_scale', 'transformer.layers.27.attention.dense.per_channel_scale', 'transformer.layers.51.attention.qkv.per_channel_scale', 'transformer.layers.66.mlp.gate.per_channel_scale', 'transformer.layers.44.mlp.proj.per_channel_scale', 'transformer.layers.78.attention.dense.per_channel_scale', 'transformer.layers.68.attention.dense.per_channel_scale', 'transformer.layers.9.attention.qkv.per_channel_scale', 'transformer.layers.52.mlp.gate.per_channel_scale', 'transformer.layers.21.attention.qkv.per_channel_scale', 'transformer.layers.79.mlp.proj.per_channel_scale', 'transformer.layers.6.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.gate.per_channel_scale', 'transformer.layers.10.mlp.gate.per_channel_scale', 'transformer.layers.31.mlp.gate.per_channel_scale', 'transformer.layers.78.mlp.gate.per_channel_scale', 'transformer.layers.40.attention.dense.per_channel_scale', 'transformer.layers.33.mlp.gate.per_channel_scale', 'transformer.layers.3.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.dense.per_channel_scale', 'transformer.layers.1.attention.qkv.per_channel_scale', 'transformer.layers.49.mlp.proj.per_channel_scale', 'transformer.layers.73.mlp.proj.per_channel_scale', 'transformer.layers.36.mlp.gate.per_channel_scale', 'transformer.layers.57.attention.qkv.per_channel_scale', 'transformer.layers.42.mlp.proj.per_channel_scale', 'transformer.layers.5.mlp.fc.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f0efac415a0>
Traceback (most recent call last):
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 377, in __del__
    self.release()
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 374, in release
    release_gc()
  File "/app/tensorrt-llm/tensorrt_llm/_utils.py", line 443, in release_gc
    torch.cuda.ipc_collect()
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 813, in ipc_collect
    _lazy_init()
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/network.py", line 26, in <module>
    from tensorrt_llm.module import Module
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/_common.py", line 26, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/venv_dev/lib/python3.10/site-packages/torch/__init__.py", line 1427, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1303, in <module>
    _lazy_call(_register_triton_kernels)
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

gloritygithub11 avatar May 22 '24 02:05 gloritygithub11

@byshiue This seems a different error not related to XQA. Could you help triage and reroute the issue?

Thanks.

ming-wei avatar May 22 '24 06:05 ming-wei

I also Met this before in https://github.com/NVIDIA/TensorRT-LLM/issues/1628

WDONG66 avatar May 23 '24 06:05 WDONG66

I don't have 70B model now. But I try 8B model and it works well under int4 wo

python3 examples/llama/convert_checkpoint.py --model_dir llama-v3-8b-instruct-hf/ --output_dir /tmp/llama-v3/ --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[05/23/2024-09:12:44] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.44it/s]
Weights loaded. Total time: 00:02:01
Total time of converting checkpoints: 00:02:14

Could you take a try?

byshiue avatar May 23 '24 09:05 byshiue

@byshiue I had tried 8B and get the same error before. Noticed that you are using new version 0.11.0.dev2024052100. I will try this version.

gloritygithub11 avatar May 23 '24 10:05 gloritygithub11

@byshiue

I sync code to Update TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/pull/1639)

still get the same error. It is trying to check the loaded model contains quantized param like transformer.layers.0.attention.qkv.per_channel_scale

python ../llama/convert_checkpoint.py --model_dir /mnt/memory/Meta-Llama-3-8B-Instruct --output_dir /mnt/memory/tmp/trt_models/Meta-Llama-3-8B-Instruct/w4a16/1-gpu-tp --dtype float16 --use_weight_only --weight_only_precision int4 --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
0.11.0.dev2024052100
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 471, in <module>
    main()
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 455, in main
    convert_and_save_hf(args)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 391, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 414, in execute
    f(args, rank)
  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 378, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/app/tensorrt-llm/tensorrt_llm/models/llama/model.py", line 280, in from_hugging_face
    llama = convert.from_hugging_face(
  File "/app/tensorrt-llm/tensorrt_llm/models/llama/convert.py", line 1337, in from_hugging_face
    llama.load(weights)
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 421, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.layers.13.attention.dense.per_channel_scale', 'transformer.layers.0.attention.qkv.per_channel_scale', 'transformer.layers.27.attention.dense.per_channel_scale', 'transformer.layers.24.attention.dense.per_channel_scale', 'transformer.layers.0.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.fc.per_channel_scale', 'transformer.layers.9.mlp.proj.per_channel_scale', 'transformer.layers.22.attention.dense.per_channel_scale', 'transformer.layers.8.mlp.fc.per_channel_scale', 'transformer.layers.11.mlp.proj.per_channel_scale', 'transformer.layers.5.attention.qkv.per_channel_scale', 'transformer.layers.23.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.gate.per_channel_scale', 'transformer.layers.12.mlp.proj.per_channel_scale', 'transformer.layers.20.attention.dense.per_channel_scale', 'transformer.layers.8.mlp.gate.per_channel_scale', 'transformer.layers.24.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.proj.per_channel_scale', 'transformer.layers.14.mlp.proj.per_channel_scale', 'transformer.layers.19.mlp.proj.per_channel_scale', 'transformer.layers.18.mlp.proj.per_channel_scale', 'transformer.layers.28.mlp.fc.per_channel_scale', 'transformer.layers.20.mlp.fc.per_channel_scale', 'transformer.layers.27.mlp.fc.per_channel_scale', 'transformer.layers.23.mlp.gate.per_channel_scale', 'transformer.layers.4.attention.qkv.per_channel_scale', 'transformer.layers.9.mlp.fc.per_channel_scale', 'transformer.layers.1.mlp.fc.per_channel_scale', 'transformer.layers.14.mlp.gate.per_channel_scale', 'transformer.layers.29.mlp.fc.per_channel_scale', 'transformer.layers.4.mlp.fc.per_channel_scale', 'transformer.layers.13.attention.qkv.per_channel_scale', 'transformer.layers.6.mlp.gate.per_channel_scale', 'transformer.layers.13.mlp.proj.per_channel_scale', 'transformer.layers.23.attention.dense.per_channel_scale', 'transformer.layers.28.attention.qkv.per_channel_scale', 'transformer.layers.16.attention.dense.per_channel_scale', 'transformer.layers.12.mlp.gate.per_channel_scale', 'transformer.layers.14.mlp.fc.per_channel_scale', 'transformer.layers.27.mlp.proj.per_channel_scale', 'transformer.layers.21.attention.qkv.per_channel_scale', 'transformer.layers.5.mlp.fc.per_channel_scale', 'transformer.layers.4.attention.dense.per_channel_scale', 'transformer.layers.4.mlp.proj.per_channel_scale', 'transformer.layers.18.attention.qkv.per_channel_scale', 'transformer.layers.18.attention.dense.per_channel_scale', 'transformer.layers.31.mlp.proj.per_channel_scale', 'transformer.layers.2.mlp.fc.per_channel_scale', 'transformer.layers.3.mlp.proj.per_channel_scale', 'transformer.layers.6.mlp.proj.per_channel_scale', 'transformer.layers.7.attention.qkv.per_channel_scale', 'transformer.layers.30.mlp.fc.per_channel_scale', 'transformer.layers.15.attention.qkv.per_channel_scale', 'transformer.layers.22.attention.qkv.per_channel_scale', 'transformer.layers.0.mlp.proj.per_channel_scale', 'transformer.layers.12.mlp.fc.per_channel_scale', 'transformer.layers.16.attention.qkv.per_channel_scale', 'transformer.layers.31.attention.dense.per_channel_scale', 'transformer.layers.18.mlp.gate.per_channel_scale', 'transformer.layers.8.attention.qkv.per_channel_scale', 'transformer.layers.2.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.proj.per_channel_scale', 'transformer.layers.1.attention.dense.per_channel_scale', 'transformer.layers.22.mlp.proj.per_channel_scale', 'transformer.layers.29.attention.qkv.per_channel_scale', 'transformer.layers.29.mlp.gate.per_channel_scale', 'transformer.layers.7.mlp.proj.per_channel_scale', 'transformer.layers.11.mlp.gate.per_channel_scale', 'transformer.layers.28.attention.dense.per_channel_scale', 'transformer.layers.31.attention.qkv.per_channel_scale', 'transformer.layers.10.mlp.fc.per_channel_scale', 'transformer.layers.28.mlp.proj.per_channel_scale', 'transformer.layers.28.mlp.gate.per_channel_scale', 'transformer.layers.6.mlp.fc.per_channel_scale', 'transformer.layers.14.attention.dense.per_channel_scale', 'transformer.layers.25.attention.qkv.per_channel_scale', 'transformer.layers.4.mlp.gate.per_channel_scale', 'transformer.layers.11.attention.qkv.per_channel_scale', 'transformer.layers.18.mlp.fc.per_channel_scale', 'transformer.layers.8.attention.dense.per_channel_scale', 'transformer.layers.6.attention.dense.per_channel_scale', 'transformer.layers.5.mlp.gate.per_channel_scale', 'transformer.layers.0.attention.dense.per_channel_scale', 'transformer.layers.23.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.fc.per_channel_scale', 'transformer.layers.2.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.fc.per_channel_scale', 'transformer.layers.10.attention.qkv.per_channel_scale', 'transformer.layers.3.attention.qkv.per_channel_scale', 'transformer.layers.7.mlp.gate.per_channel_scale', 'transformer.layers.30.attention.dense.per_channel_scale', 'transformer.layers.20.mlp.gate.per_channel_scale', 'transformer.layers.15.attention.dense.per_channel_scale', 'transformer.layers.14.attention.qkv.per_channel_scale', 'transformer.layers.15.mlp.fc.per_channel_scale', 'transformer.layers.30.mlp.proj.per_channel_scale', 'transformer.layers.24.mlp.gate.per_channel_scale', 'transformer.layers.8.mlp.proj.per_channel_scale', 'transformer.layers.12.attention.qkv.per_channel_scale', 'transformer.layers.19.attention.dense.per_channel_scale', 'transformer.layers.1.attention.qkv.per_channel_scale', 'transformer.layers.19.mlp.gate.per_channel_scale', 'transformer.layers.25.mlp.gate.per_channel_scale', 'transformer.layers.26.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.gate.per_channel_scale', 'transformer.layers.27.attention.qkv.per_channel_scale', 'transformer.layers.2.mlp.gate.per_channel_scale', 'transformer.layers.9.attention.qkv.per_channel_scale', 'transformer.layers.9.attention.dense.per_channel_scale', 'transformer.layers.5.mlp.proj.per_channel_scale', 'transformer.layers.26.mlp.fc.per_channel_scale', 'transformer.layers.25.attention.dense.per_channel_scale', 'transformer.layers.20.attention.qkv.per_channel_scale', 'transformer.layers.26.mlp.proj.per_channel_scale', 'transformer.layers.20.mlp.proj.per_channel_scale', 'transformer.layers.31.mlp.gate.per_channel_scale', 'transformer.layers.30.mlp.gate.per_channel_scale', 'transformer.layers.21.mlp.proj.per_channel_scale', 'transformer.layers.16.mlp.gate.per_channel_scale', 'transformer.layers.12.attention.dense.per_channel_scale', 'transformer.layers.17.mlp.fc.per_channel_scale', 'transformer.layers.29.attention.dense.per_channel_scale', 'transformer.layers.26.attention.qkv.per_channel_scale', 'transformer.layers.21.attention.dense.per_channel_scale', 'transformer.layers.5.attention.dense.per_channel_scale', 'transformer.layers.15.mlp.proj.per_channel_scale', 'transformer.layers.19.attention.qkv.per_channel_scale', 'transformer.layers.24.mlp.fc.per_channel_scale', 'transformer.layers.26.mlp.gate.per_channel_scale', 'transformer.layers.22.mlp.fc.per_channel_scale', 'transformer.layers.7.mlp.fc.per_channel_scale', 'transformer.layers.13.mlp.gate.per_channel_scale', 'transformer.layers.1.mlp.proj.per_channel_scale', 'transformer.layers.10.mlp.proj.per_channel_scale', 'transformer.layers.3.mlp.gate.per_channel_scale', 'transformer.layers.17.mlp.proj.per_channel_scale', 'transformer.layers.21.mlp.fc.per_channel_scale', 'transformer.layers.1.mlp.gate.per_channel_scale', 'transformer.layers.30.attention.qkv.per_channel_scale', 'transformer.layers.13.mlp.fc.per_channel_scale', 'transformer.layers.2.attention.dense.per_channel_scale', 'transformer.layers.23.mlp.fc.per_channel_scale', 'transformer.layers.31.mlp.fc.per_channel_scale', 'transformer.layers.10.mlp.gate.per_channel_scale', 'transformer.layers.11.mlp.fc.per_channel_scale', 'transformer.layers.3.attention.dense.per_channel_scale', 'transformer.layers.25.mlp.proj.per_channel_scale', 'transformer.layers.7.attention.dense.per_channel_scale', 'transformer.layers.16.mlp.proj.per_channel_scale', 'transformer.layers.27.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.dense.per_channel_scale', 'transformer.layers.10.attention.dense.per_channel_scale', 'transformer.layers.0.mlp.fc.per_channel_scale', 'transformer.layers.11.attention.dense.per_channel_scale', 'transformer.layers.9.mlp.gate.per_channel_scale', 'transformer.layers.17.attention.qkv.per_channel_scale', 'transformer.layers.22.mlp.gate.per_channel_scale', 'transformer.layers.17.mlp.gate.per_channel_scale', 'transformer.layers.6.attention.qkv.per_channel_scale', 'transformer.layers.3.mlp.fc.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f9b6817de10>
Traceback (most recent call last):
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 377, in __del__
    self.release()
  File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 374, in release
    release_gc()
  File "/app/tensorrt-llm/tensorrt_llm/_utils.py", line 443, in release_gc
    torch.cuda.ipc_collect()
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 813, in ipc_collect
    _lazy_init()
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/app/tensorrt-llm/examples/llama/../llama/convert_checkpoint.py", line 9, in <module>
    import tensorrt_llm
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/__init__.py", line 32, in <module>
    import tensorrt_llm.functional as functional
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/functional.py", line 28, in <module>
    from . import graph_rewriting as gw
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/graph_rewriting.py", line 12, in <module>
    from .network import Network
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/network.py", line 26, in <module>
    from tensorrt_llm.module import Module
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/module.py", line 17, in <module>
    from ._common import default_net
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/tensorrt-llm/tensorrt_llm/_common.py", line 26, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/venv_dev/lib/python3.10/site-packages/torch/__init__.py", line 1427, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1303, in <module>
    _lazy_call(_register_triton_kernels)
  File "/app/venv_dev/lib/python3.10/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

gloritygithub11 avatar May 24 '24 01:05 gloritygithub11

I also tried in a clean docker envi, the same error.

gloritygithub11 avatar May 24 '24 04:05 gloritygithub11

This issue would be fixed in next main branch update. Please give a try at the time.

byshiue avatar May 27 '24 09:05 byshiue

This issue is fixed in latest main branch (commit id: f430a4b447ef4cba22698902d43eae0debf08594). Could you take a try?

byshiue avatar May 29 '24 02:05 byshiue

works now. Thank you very mush

gloritygithub11 avatar May 30 '24 04:05 gloritygithub11

I am getting this error now when I am trying to convert a fine tuned llama3 8b gptq safetensor. Does the patch f430a4b addresses gptq ?

ashwin-js avatar Jul 15 '24 04:07 ashwin-js

Could you file a new issue to share the error you encounter and the reproduced steps?

byshiue avatar Jul 17 '24 07:07 byshiue