I am trying to run distributed inference benchmarks for a large model (running across 4 nodes / 32 GPUs), and using trtllm-bench for the same. However, I run into this error, and I am not sure what fixes it

INFO - flashinfer.jit: Finished loading JIT ops: norm
free(): double free detected in tcache 2
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)

[4] init.cc:720 NCCL WARN Duplicate GPU detected : rank 28 and rank 4 both on CUDA device *****
[ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x15555520c320]
[ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x155555265b1c]
[ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x15555520c26e]
[ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x1555551ef8ff]
[ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x1555551f07b6]
[ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa8fe5)[0x15555526ffe5]
[ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xab54f)[0x15555527254f]
[ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x155555274d9e]
 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6437e)[0x1552ff26337e]

Thanks!

Mar 19 '25 18:03 snl-nvda

I encountered the same issue while trying to deploy deepseek-r1 using trtllm-serve on two nodes with 16 GPUs (H20).

Mar 20 '25 02:03 Nekofish-L

@snl-nvda @Nekofish-L

Could you share the following information for investigation:

The version/commit of the tensorrt_llm package you used
The full logs of the execution
The command/script for reproduction, please keep it neat with only tensorrt_llm logic

Mar 20 '25 02:03 Superjomn

@snl-nvda @Nekofish-L

Could you share the following information for investigation:

The version/commit of the tensorrt_llm package you used

The full logs of the execution

The command/script for reproduction, please keep it neat with only tensorrt_llm logic

@Superjomn Hi, I come across to this bug too.

docker image nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3

tensorrt_llm version 0.19.0.dev2025031800 and the latest master

command mpirun -H x.x.x.x:8,x.x.x.x:8 -mca plm_rsh_args "-p 2234" --allow-run-as-root -n 16 trtllm-llmapi-launch trtllm-serve /data/deepseek/DeepSeek-R1 --tokenizer /data/deepseek/DeepSeek-R1 --max_seq_len 6000 --tp_size 16 --gpus_per_node 8 --extra_llm_api_options /data/extra-llm-api-config.yml --backend pytorch --max_batch_size 300 --max_num_tokens 200 --host 0.0.0.0

extra-llm-api-config.yml enable_attention_dp: false pytorch_backend_config: enable_overlap_scheduler: true use_cuda_graph: true cuda_graph_max_batch_size: 4

logs trtllm_deepseek.log

Mar 24 '25 11:03 lishicheng1996

Same issue observed running dsv3 example trtllm-bench cmd on single node H200

Command: trtllm-bench --model deepseek-ai/DeepSeek-V3 --model_path /workspace/dsv0324/ throughput --backend pytorch --max_batch_size 2 --max_num_tokens 1160 --dataset /workspace/dataset.txt --tp 8 --ep 4 --pp 1 --concurrency 2 --streaming --kv_cache_free_gpu_mem_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml 2>&1 | tee /workspace/trt_bench.log

Built from: a570578c7faea619244d57d6599e4afa58537119

Mar 24 '25 17:03 laikhtewari

Let me check if this issue is orthogonal to the node count.

Mar 25 '25 01:03 Superjomn

hi,I got the same error when using the latest image:nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3 is that I can't using the trtllm-serve ? python3 /app/examples/qwen/convert_checkpoint.py --model_dir /models/Qwen2.5-32B-Instruct
--tp_size 2
--use_weight_only
--weight_only_precision int4
--output_dir /c-model/qwen2.5-32b/int-4/2-gpu

build engines

trtllm-build --checkpoint_dir /c-model/qwen2.5-32b/int-4/2-gpu
--remove_input_padding enable
--kv_cache_type paged
--workers 1
--output_dir /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b

serve model

trtllm-serve /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b
--tokenizer /models/Qwen2.5-7B-Instruct
--max_batch_size 128 --max_num_tokens 4096 --max_seq_len 4096
--host "0.0.0.0" --port 4005 --tp_size 2

[TensorRT-LLM][INFO] Loaded engine size: 9709 MiB [TensorRT-LLM][INFO] Loaded engine size: 9709 MiB free(): double free detected in tcache 2 [gpu8:39122] *** Process received signal *** [gpu8:39122] Signal: Aborted (6) [gpu8:39122] Signal code: (-6) free(): double free detected in tcache 2 [gpu8:39123] *** Process received signal *** [gpu8:39123] Signal: Aborted (6) [gpu8:39123] Signal code: (-6)

@SuperjomnCould you please have a look, I 'd appreciate it if you can help!

Mar 26 '25 09:03 Justin-12138

@Justin-12138 did you used the builtin tensorrt_llm in the nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3 docker? That is an older version. The preferred way is to build the tensorrt_llm from source in the nvcr.io/nvidia/pytorch:25.01-py3 docker or just pip install in a Ubuntu 24.04 docker. Let me check this docker with the models and try to get some recipes.

Mar 26 '25 12:03 Superjomn

@Superjomn ,I installed the latest version in container,the model is Qwen2.5-32B-Instruct,

root@gpu8:/opt/tritonserver# pip list | grep tensor
safetensors            0.5.2
tensorrt               10.8.0.43
tensorrt_llm           0.17.0.post1
got the same error as follow
[TensorRT-LLM][INFO] Loaded engine size: 9709 MiB
[TensorRT-LLM][INFO] Loaded engine size: 9709 MiB
double free or corruption (!prev)
[gpu8:40527] *** Process received signal ***
[gpu8:40527] Signal: Aborted (6)
[gpu8:40527] Signal code:  (-6)
free(): double free detected in tcache 2
[gpu8:40526] *** Process received signal ***
[gpu8:40526] Signal: Aborted (6)
[gpu8:40526] Signal code:  (-6)
malloc(): unsorted double linked list corrupted
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Mar 26 '25 15:03 Justin-12138

you can reset by

rm -rf /dev/shm/nccl-*

Mar 26 '25 21:03 aspctu

Thanks @aspctu , that did help avoid the "double free" error.

I am trying with the deepseek v3 model and running into another size error though (with the number of kv_cache tokens): RuntimeError: N must be a multiple of 128, (N=2112). Which config param needs to be adapted to get over this?

Full trace:

[03/26/2025-14:56:42] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 628, in worker_main
    worker: ExecutorBindingsWorker = worker_cls(
                                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 119, in __init__
    self.engine = _create_engine()
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 115, in _create_engine
    return create_executor(executor_config=executor_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 115, in create_py_executor
    kv_cache_max_tokens = estimate_max_kv_cache_tokens(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 126, in estimate_max_kv_cache_tokens
    model_engine.forward(req, resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1373, in forward
    return self._forward_step(inputs, gather_ids)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1425, in _forward_step
    logits = self.model.forward(**inputs,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 728, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 670, in forward
    hidden_states, residual = decoder_layer(position_ids=position_ids,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 453, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/attention.py", line 392, in forward
    compressed_q, compressed_kv, k_pe = self.fused_a(
                                        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 385, in forward
    output = self.apply_linear(input, self.weight, self.bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 315, in apply_linear
    output = torch.ops.trtllm.fp8_block_scaling_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: N must be a multiple of 128, (N=2112)

Mar 26 '25 22:03 snl-nvda

Running into the same issue. There's a 128 divisibility check on BlackWell but not on Hopper. On Hopper it's 16, in the fp8_block_scaling_gemm.

https://github.com/NVIDIA/TensorRT-LLM/blob/60d4dacc47ba18b3aed425dd4c5af8cbc8068169/cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp#L135

I wonder why the difference.

Mar 26 '25 23:03 pankajroark

@Justin-12138 does Qwen2.5-32B-Instruct work now?

Mar 26 '25 23:03 Superjomn

@Justin-12138 does Qwen2.5-32B-Instruct work now?

I will try it later，But I got a strange problem when using bare-metal environment ，I tried the trtllm-serve /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b --tokenizer /models/Qwen2.5-7B-Instruct --max_batch_size 128 --max_num_tokens 4096 --max_seq_len 4096 --host "0.0.0.0" --port 4005 --tp_size 2 And there is no error，no port occupation no gpu occupation

Mar 27 '25 01:03 Justin-12138

@Superjomn I tried it again the same error

# Convert weights from HF Tranformers to TensorRT-LLM checkpoint
python3 /app/examples/qwen/convert_checkpoint.py --model_dir /models/Qwen2.5-32B-Instruct \
        --tp_size 2 \
        --use_weight_only \
        --weight_only_precision int4 \
        --output_dir /c-model/qwen2.5-32b/int-4/2-gpu
        
# build engines
trtllm-build --checkpoint_dir /c-model/qwen2.5-32b/int-4/2-gpu \
        --remove_input_padding enable \
        --kv_cache_type paged \
        --workers 1 \
        --output_dir /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b

# serve model
trtllm-serve /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b \
--tokenizer /models/Qwen2.5-7B-Instruct \
--max_batch_size 128 --max_num_tokens 4096 --max_seq_len 4096 \
--host "0.0.0.0" --port 4005 --tp_size 2

# the error：
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 9709 MiB
[TensorRT-LLM][INFO] Loaded engine size: 9709 MiB
double free or corruption (!prev)
[gpu8:41593] *** Process received signal ***
[gpu8:41593] Signal: Aborted (6)
[gpu8:41593] Signal code:  (-6)
free(): double free detected in tcache 2
[gpu8:41592] *** Process received signal ***
[gpu8:41592] Signal: Aborted (6)
[gpu8:41592] Signal code:  (-6)
malloc(): unsorted double linked list corrupted
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Mar 27 '25 02:03 Justin-12138

@Superjomn I tried it again the same error

Convert weights from HF Tranformers to TensorRT-LLM checkpoint

python3 /app/examples/qwen/convert_checkpoint.py --model_dir /models/Qwen2.5-32B-Instruct
--tp_size 2
--use_weight_only
--weight_only_precision int4
--output_dir /c-model/qwen2.5-32b/int-4/2-gpu

build engines

trtllm-build --checkpoint_dir /c-model/qwen2.5-32b/int-4/2-gpu
--remove_input_padding enable
--kv_cache_type paged
--workers 1
--output_dir /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b

serve model

trtllm-serve /engines/qwen2.5-32b/int4-2gpu/qwen2.5-32b
--tokenizer /models/Qwen2.5-7B-Instruct
--max_batch_size 128 --max_num_tokens 4096 --max_seq_len 4096
--host "0.0.0.0" --port 4005 --tp_size 2

the error：

[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 9709 MiB [TensorRT-LLM][INFO] Loaded engine size: 9709 MiB double free or corruption (!prev) [gpu8:41593] *** Process received signal *** [gpu8:41593] Signal: Aborted (6) [gpu8:41593] Signal code: (-6) free(): double free detected in tcache 2 [gpu8:41592] *** Process received signal *** [gpu8:41592] Signal: Aborted (6) [gpu8:41592] Signal code: (-6) malloc(): unsorted double linked list corrupted

Child job 2 terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

@Justin-12138 OK, let me check the public recipe.

Mar 27 '25 02:03 Superjomn

@Justin-12138 I also encountered some issues with pip install tensorrt_llm directly in docker ubuntu:24.04, which we will continue to investigate.

Currently, my runable recipe is to build from source according to this guide, and here are the commands:

# in tensorrt_llm root, build a docker from source
make -C docker release_build
# this will build docker.io/tensorrt_llm/release:latest with tensorrt_llm installed

Repeat your commands, and it will run smoothly. tested on 2xH100 on latest main branch of commit 82edd903.

Mar 27 '25 05:03 Superjomn

Thanks @aspctu , that did help avoid the "double free" error.

I am trying with the deepseek v3 model and running into another size error though (with the number of kv_cache tokens): RuntimeError: N must be a multiple of 128, (N=2112). Which config param needs to be adapted to get over this?

Full trace:

[03/26/2025-14:56:42] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 628, in worker_main
    worker: ExecutorBindingsWorker = worker_cls(
                                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 119, in __init__
    self.engine = _create_engine()
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 115, in _create_engine
    return create_executor(executor_config=executor_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 115, in create_py_executor
    kv_cache_max_tokens = estimate_max_kv_cache_tokens(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 126, in estimate_max_kv_cache_tokens
    model_engine.forward(req, resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1373, in forward
    return self._forward_step(inputs, gather_ids)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1425, in _forward_step
    logits = self.model.forward(**inputs,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 728, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 670, in forward
    hidden_states, residual = decoder_layer(position_ids=position_ids,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 453, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/attention.py", line 392, in forward
    compressed_q, compressed_kv, k_pe = self.fused_a(
                                        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 385, in forward
    output = self.apply_linear(input, self.weight, self.bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 315, in apply_linear
    output = torch.ops.trtllm.fp8_block_scaling_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: N must be a multiple of 128, (N=2112)

@snl-nvda this issue should be fixed in this PR, cc @chang-l for viz.

Mar 31 '25 23:03 Superjomn

Same double free error when running deepseek-v2 on a single node H100 with tp_size > 1.

Apr 05 '25 21:04 EstherBear

@EstherBear Can you try building docker from source as mentioned in this comment?

Apr 07 '25 02:04 Superjomn

I am getting the same issue using the latest tritonserver container: nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3 it won't load any engines that I build with it. I am pretty much using this process: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md. Create Checkpoints -> Build Engine -> Create model repository -> Run launch_triton_server.py. This process has worked for me without any issues for 24.XX containers and seems to be new at least for 25.03 and 25.02.

Apr 08 '25 12:04 roblen001

cc @Shixiaowei02 and @kaiyux for viz.

Apr 08 '25 12:04 Superjomn

I also see the double free() error message. I am running on GB200 using slurm and a custom trtllm installation within a sqsh file. Do you also have recipes for preventing this error on ARM64?

Apr 08 '25 22:04 snl-nvda

@Justin-12138 does Qwen2.5-32B-Instruct work now?

@Superjomn Hi, I stiil see double free error when running DeepSeek-R1 on 2*H20 with make -C docker release_build on 0.19.0.dev2025040800

Apr 11 '25 07:04 lishicheng1996

@lishicheng1996 We are investigating this and will update state here. @Shixiaowei02 for viz.

Apr 11 '25 07:04 Superjomn

@EstherBear Can you try building docker from source as mentioned in this comment?

It seems like this commit does not support deepseek-v2.

Apr 11 '25 07:04 EstherBear

[4] init.cc:720 NCCL WARN Duplicate GPU detected : rank 28 and rank 4 both on CUDA device *****

Thank you for helping to identify this issue. As shown in the error message, this error is caused by attempting to run different ranks on the same CUDA device. I am working on fixing this problem to provide a clearer error message. https://github.com/NVIDIA/TensorRT-LLM/pull/3525

Apr 14 '25 08:04 Shixiaowei02

run different ranks on the same CUDA device

Thanks for the investigating! Is there any method to avoid this problem when run trtlllm-bench or trtlllm-serve?

Apr 14 '25 09:04 lishicheng1996

I also have this double freee error, with version v0.19.0rc0

Apr 17 '25 12:04 bobbych94

@nanmi Could you set NCCL_DEBUG=INFO to obtain more debugging information before NCCL crashes and paste it to here?

Apr 18 '25 01:04 Shixiaowei02

@nanmi Could you set NCCL_DEBUG=INFO to obtain more debugging information before NCCL crashes and paste it to here?

This is my running code: I use H20 96Gx8

extra-llm-api-config-deepseek_h20.yml

pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    - 384
    print_iter_log: false
    enable_overlap_scheduler: true
enable_attention_dp: false

trtllm-serve \
    /model/deepseek/DeepSeek-R1 \
    --host 0.0.0.0 \
    --port 8000 \
    --backend pytorch \
    --max_batch_size 4 \
    --max_num_tokens 1280 \
    --tp_size 8 \
    --pp_size 1 \
    --ep_size 4 \
    --kv_cache_free_gpu_memory_fraction 0.97 \
    --extra_llm_api_options extra-llm-api-config-deepseek_h20.yml
2025-04-18 07:46:18,739 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[04/18/2025-07:46:19] [TRT-LLM] [W] Overriding pytorch_backend_config 
[04/18/2025-07:46:19] [TRT-LLM] [I] Compute capability: (9, 0)
[04/18/2025-07:46:19] [TRT-LLM] [I] SM count: 78
[04/18/2025-07:46:19] [TRT-LLM] [I] SM clock: 1980 MHz
[04/18/2025-07:46:19] [TRT-LLM] [I] int4 TFLOPS: 0
[04/18/2025-07:46:19] [TRT-LLM] [I] int8 TFLOPS: 2530
[04/18/2025-07:46:19] [TRT-LLM] [I] fp8 TFLOPS: 2530
[04/18/2025-07:46:19] [TRT-LLM] [I] float16 TFLOPS: 1265
[04/18/2025-07:46:19] [TRT-LLM] [I] bfloat16 TFLOPS: 1265
[04/18/2025-07:46:19] [TRT-LLM] [I] float32 TFLOPS: 632
[04/18/2025-07:46:19] [TRT-LLM] [I] Total Memory: 95 GiB
[04/18/2025-07:46:19] [TRT-LLM] [I] Memory clock: 2619 MHz
[04/18/2025-07:46:19] [TRT-LLM] [I] Memory bus width: 6144
[04/18/2025-07:46:19] [TRT-LLM] [I] Memory bandwidth: 4022 GB/s
[04/18/2025-07:46:19] [TRT-LLM] [I] NVLink is active: True
[04/18/2025-07:46:19] [TRT-LLM] [I] NVLink version: 3
[04/18/2025-07:46:19] [TRT-LLM] [I] NVLink bandwidth: 300 GB/s
[04/18/2025-07:46:19] [TRT-LLM] [I] Set nccl_plugin to None.
[04/18/2025-07:46:19] [TRT-LLM] [I] start MpiSession with 8 workers
[04/18/2025-07:46:19] [TRT-LLM] [I] Found /model/deepseek/DeepSeek-R1/hf_quant_config.json, pre-quantized checkpoint is used.
[04/18/2025-07:46:19] [TRT-LLM] [I] Setting quant_algo=FP8_BLOCK_SCALES form HF quant config.
[04/18/2025-07:46:20] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 32, 64, 128, 256, 384], cuda_graph_max_batch_size=0, cuda_graph_padding_enabled=True, enable_overlap_scheduler=True, moe_max_num_tokens=None, attn_backend='TRTLLM', mixed_decoder=False, enable_trtllm_decoder=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=False, torch_compile_inductor_enabled=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
2025-04-18 07:46:37,738 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,834 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,881 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,881 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,912 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,912 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,912 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-04-18 07:46:37,914 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0rc0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[04/18/2025-07:46:44] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[04/18/2025-07:46:44] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[04/18/2025-07:46:46] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00001-of-000163.safetensors
[04/18/2025-07:46:48] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00002-of-000163.safetensors
[04/18/2025-07:46:50] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00003-of-000163.safetensors
[04/18/2025-07:46:52] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00004-of-000163.safetensors
[04/18/2025-07:46:53] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00005-of-000163.safetensors
[04/18/2025-07:46:55] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00006-of-000163.safetensors
[04/18/2025-07:46:57] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00007-of-000163.safetensors
[04/18/2025-07:46:59] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00008-of-000163.safetensors
[04/18/2025-07:47:01] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00009-of-000163.safetensors
[04/18/2025-07:47:03] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00010-of-000163.safetensors
[04/18/2025-07:47:05] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00011-of-000163.safetensors
[04/18/2025-07:47:07] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00012-of-000163.safetensors
[04/18/2025-07:47:08] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00013-of-000163.safetensors
[04/18/2025-07:47:10] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00014-of-000163.safetensors
[04/18/2025-07:47:12] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00015-of-000163.safetensors
[04/18/2025-07:47:14] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00016-of-000163.safetensors
[04/18/2025-07:47:15] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00017-of-000163.safetensors
[04/18/2025-07:47:17] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00018-of-000163.safetensors
[04/18/2025-07:47:19] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00019-of-000163.safetensors
[04/18/2025-07:47:21] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00020-of-000163.safetensors
[04/18/2025-07:47:23] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00021-of-000163.safetensors
[04/18/2025-07:47:25] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00022-of-000163.safetensors
[04/18/2025-07:47:27] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00023-of-000163.safetensors
[04/18/2025-07:47:29] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00024-of-000163.safetensors
[04/18/2025-07:47:31] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00025-of-000163.safetensors
[04/18/2025-07:47:33] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00026-of-000163.safetensors
[04/18/2025-07:47:35] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00027-of-000163.safetensors
[04/18/2025-07:47:37] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00028-of-000163.safetensors
[04/18/2025-07:47:39] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00029-of-000163.safetensors
[04/18/2025-07:47:41] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00030-of-000163.safetensors
[04/18/2025-07:47:43] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00031-of-000163.safetensors
[04/18/2025-07:47:44] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00032-of-000163.safetensors
[04/18/2025-07:47:47] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00033-of-000163.safetensors
[04/18/2025-07:47:49] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00034-of-000163.safetensors
[04/18/2025-07:47:49] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00035-of-000163.safetensors
[04/18/2025-07:47:51] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00036-of-000163.safetensors
[04/18/2025-07:47:53] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00037-of-000163.safetensors
[04/18/2025-07:47:55] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00038-of-000163.safetensors
[04/18/2025-07:47:57] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00039-of-000163.safetensors
[04/18/2025-07:47:59] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00040-of-000163.safetensors
[04/18/2025-07:48:01] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00041-of-000163.safetensors
[04/18/2025-07:48:03] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00042-of-000163.safetensors
[04/18/2025-07:48:05] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00043-of-000163.safetensors
[04/18/2025-07:48:07] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00044-of-000163.safetensors
[04/18/2025-07:48:09] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00045-of-000163.safetensors
[04/18/2025-07:48:11] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00046-of-000163.safetensors
[04/18/2025-07:48:13] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00047-of-000163.safetensors
[04/18/2025-07:48:15] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00048-of-000163.safetensors
[04/18/2025-07:48:17] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00049-of-000163.safetensors
[04/18/2025-07:48:19] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00050-of-000163.safetensors
[04/18/2025-07:48:21] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00051-of-000163.safetensors
[04/18/2025-07:48:23] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00052-of-000163.safetensors
[04/18/2025-07:48:25] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00053-of-000163.safetensors
[04/18/2025-07:48:27] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00054-of-000163.safetensors
[04/18/2025-07:48:29] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00055-of-000163.safetensors
[04/18/2025-07:48:31] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00056-of-000163.safetensors
[04/18/2025-07:48:32] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00057-of-000163.safetensors
[04/18/2025-07:48:34] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00058-of-000163.safetensors
[04/18/2025-07:48:36] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00059-of-000163.safetensors
[04/18/2025-07:48:38] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00060-of-000163.safetensors
[04/18/2025-07:48:40] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00061-of-000163.safetensors
[04/18/2025-07:48:42] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00062-of-000163.safetensors
[04/18/2025-07:48:43] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00063-of-000163.safetensors
[04/18/2025-07:48:45] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00064-of-000163.safetensors
[04/18/2025-07:48:47] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00065-of-000163.safetensors
[04/18/2025-07:48:49] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00066-of-000163.safetensors
[04/18/2025-07:48:51] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00067-of-000163.safetensors
[04/18/2025-07:48:53] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00068-of-000163.safetensors
[04/18/2025-07:48:55] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00069-of-000163.safetensors
[04/18/2025-07:48:57] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00070-of-000163.safetensors
[04/18/2025-07:48:59] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00071-of-000163.safetensors
[04/18/2025-07:49:01] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00072-of-000163.safetensors
[04/18/2025-07:49:03] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00073-of-000163.safetensors
[04/18/2025-07:49:05] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00074-of-000163.safetensors
[04/18/2025-07:49:07] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00075-of-000163.safetensors
[04/18/2025-07:49:09] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00076-of-000163.safetensors
[04/18/2025-07:49:11] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00077-of-000163.safetensors
[04/18/2025-07:49:13] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00078-of-000163.safetensors
[04/18/2025-07:49:14] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00079-of-000163.safetensors
[04/18/2025-07:49:16] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00080-of-000163.safetensors
[04/18/2025-07:49:18] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00081-of-000163.safetensors
[04/18/2025-07:49:20] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00082-of-000163.safetensors
[04/18/2025-07:49:22] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00083-of-000163.safetensors
[04/18/2025-07:49:24] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00084-of-000163.safetensors
[04/18/2025-07:49:26] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00085-of-000163.safetensors
[04/18/2025-07:49:28] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00086-of-000163.safetensors
[04/18/2025-07:49:30] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00087-of-000163.safetensors
[04/18/2025-07:49:32] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00088-of-000163.safetensors
[04/18/2025-07:49:34] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00089-of-000163.safetensors
[04/18/2025-07:49:36] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00090-of-000163.safetensors
[04/18/2025-07:49:38] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00091-of-000163.safetensors
[04/18/2025-07:49:39] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00092-of-000163.safetensors
[04/18/2025-07:49:41] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00093-of-000163.safetensors
[04/18/2025-07:49:44] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00094-of-000163.safetensors
[04/18/2025-07:49:46] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00095-of-000163.safetensors
[04/18/2025-07:49:48] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00096-of-000163.safetensors
[04/18/2025-07:49:49] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00097-of-000163.safetensors
[04/18/2025-07:49:51] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00098-of-000163.safetensors
[04/18/2025-07:49:54] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00099-of-000163.safetensors
[04/18/2025-07:49:56] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00100-of-000163.safetensors
[04/18/2025-07:49:56] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00101-of-000163.safetensors
[04/18/2025-07:49:58] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00102-of-000163.safetensors
[04/18/2025-07:50:00] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00103-of-000163.safetensors
[04/18/2025-07:50:02] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00104-of-000163.safetensors
[04/18/2025-07:50:04] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00105-of-000163.safetensors
[04/18/2025-07:50:06] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00106-of-000163.safetensors
[04/18/2025-07:50:08] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00107-of-000163.safetensors
[04/18/2025-07:50:11] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00108-of-000163.safetensors
[04/18/2025-07:50:13] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00109-of-000163.safetensors
[04/18/2025-07:50:15] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00110-of-000163.safetensors
[04/18/2025-07:50:16] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00111-of-000163.safetensors
[04/18/2025-07:50:18] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00112-of-000163.safetensors
[04/18/2025-07:50:20] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00113-of-000163.safetensors
[04/18/2025-07:50:22] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00114-of-000163.safetensors
[04/18/2025-07:50:24] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00115-of-000163.safetensors
[04/18/2025-07:50:26] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00116-of-000163.safetensors
[04/18/2025-07:50:28] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00117-of-000163.safetensors
[04/18/2025-07:50:30] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00118-of-000163.safetensors
[04/18/2025-07:50:32] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00119-of-000163.safetensors
[04/18/2025-07:50:35] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00120-of-000163.safetensors
[04/18/2025-07:50:37] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00121-of-000163.safetensors
[04/18/2025-07:50:39] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00122-of-000163.safetensors
[04/18/2025-07:50:39] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00123-of-000163.safetensors
[04/18/2025-07:50:41] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00124-of-000163.safetensors
[04/18/2025-07:50:43] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00125-of-000163.safetensors
[04/18/2025-07:50:45] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00126-of-000163.safetensors
[04/18/2025-07:50:47] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00127-of-000163.safetensors
[04/18/2025-07:50:49] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00128-of-000163.safetensors
[04/18/2025-07:50:51] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00129-of-000163.safetensors
[04/18/2025-07:50:53] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00130-of-000163.safetensors
[04/18/2025-07:50:55] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00131-of-000163.safetensors
[04/18/2025-07:50:57] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00132-of-000163.safetensors
[04/18/2025-07:50:59] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00133-of-000163.safetensors
[04/18/2025-07:51:01] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00134-of-000163.safetensors
[04/18/2025-07:51:03] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00135-of-000163.safetensors
[04/18/2025-07:51:05] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00136-of-000163.safetensors
[04/18/2025-07:51:07] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00137-of-000163.safetensors
[04/18/2025-07:51:09] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00138-of-000163.safetensors
[04/18/2025-07:51:11] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00139-of-000163.safetensors
[04/18/2025-07:51:13] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00140-of-000163.safetensors
[04/18/2025-07:51:15] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00141-of-000163.safetensors
[04/18/2025-07:51:17] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00142-of-000163.safetensors
[04/18/2025-07:51:19] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00143-of-000163.safetensors
[04/18/2025-07:51:21] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00144-of-000163.safetensors
[04/18/2025-07:51:23] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00145-of-000163.safetensors
[04/18/2025-07:51:25] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00146-of-000163.safetensors
[04/18/2025-07:51:27] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00147-of-000163.safetensors
[04/18/2025-07:51:29] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00148-of-000163.safetensors
[04/18/2025-07:51:31] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00149-of-000163.safetensors
[04/18/2025-07:51:34] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00150-of-000163.safetensors
[04/18/2025-07:51:36] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00151-of-000163.safetensors
[04/18/2025-07:51:38] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00152-of-000163.safetensors
[04/18/2025-07:51:40] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00153-of-000163.safetensors
[04/18/2025-07:51:42] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00154-of-000163.safetensors
[04/18/2025-07:51:44] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00155-of-000163.safetensors
[04/18/2025-07:51:46] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00156-of-000163.safetensors
[04/18/2025-07:51:48] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00157-of-000163.safetensors
[04/18/2025-07:51:50] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00158-of-000163.safetensors
[04/18/2025-07:51:52] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00159-of-000163.safetensors
[04/18/2025-07:51:54] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00160-of-000163.safetensors
[04/18/2025-07:51:56] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00161-of-000163.safetensors
[04/18/2025-07:51:57] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00162-of-000163.safetensors
[04/18/2025-07:51:58] [TRT-LLM] [I] Loading /model/deepseek/DeepSeek-R1/model-00163-of-000163.safetensors
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.96s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.96s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.99s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.99s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.99s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 591.99s
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Loading weights: 100%|██████████| 1707/1707 [04:37<00:00,  6.14it/s]
Model init total -- 592.00s
Model init total -- 592.01s
[04/18/2025-07:56:43] [TRT-LLM] [I] max_seq_len is not specified, using inferred value 163840
[04/18/2025-07:56:43] [TRT-LLM] [I] Change tokens_per_block to: 64 for using FlashMLA
[04/18/2025-07:56:43] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.9700000286102295 and 136335, respectively). The smaller value will be used.
[04/18/2025-07:56:43] [TRT-LLM] [W] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. They are reduced to 136384
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[04/18/2025-07:56:43] [TRT-LLM] [I] max_seq_len=136384, max_num_requests=4, max_num_tokens=1280
[04/18/2025-07:56:43] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[04/18/2025-07:56:43] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 8.93 GiB for max tokens in paged KV cache (136384).
2025-04-18 07:56:43,946 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,947 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,947 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,947 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,957 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,958 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,958 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,959 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-04-18 07:56:43,964 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,016 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,063 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,115 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,176 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,228 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,281 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-04-18 07:56:44,334 - INFO - flashinfer.jit: Finished loading JIT ops: norm
free(): double free detected in tcache 2
[A15-R45-I1-133-CB29022:10730] *** Process received signal ***
[A15-R45-I1-133-CB29022:10730] Signal: Aborted (6)
[A15-R45-I1-133-CB29022:10730] Signal code:  (-6)
[A15-R45-I1-133-CB29022:10730] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f7f081b6330]
[A15-R45-I1-133-CB29022:10730] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f7f0820fb2c]
[A15-R45-I1-133-CB29022:10730] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f7f081b627e]
[A15-R45-I1-133-CB29022:10730] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f7f081998ff]
[A15-R45-I1-133-CB29022:10730] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7f7f0819a7b6]
[A15-R45-I1-133-CB29022:10730] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa8ff5)[0x7f7f08219ff5]
[A15-R45-I1-133-CB29022:10730] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xab55f)[0x7f7f0821c55f]
[A15-R45-I1-133-CB29022:10730] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x7f7f0821edae]
[A15-R45-I1-133-CB29022:10730] [ 8] /lib/x86_64-linux-gnu/libnccl.so.2(+0x6437e)[0x7f7c8a46337e]
[A15-R45-I1-133-CB29022:10730] [ 9] /lib/x86_64-linux-gnu/libnccl.so.2(pncclCommInitRank+0x153)[0x7f7c8a465513]
[A15-R45-I1-133-CB29022:10730] [10] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_Z7getCommRKSt3setIiSt4lessIiESaIiEE+0x5f9)[0x7f7a7e6d1af9]
[A15-R45-I1-133-CB29022:10730] [11] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZN9torch_ext9allreduceEN2at6TensorESt8optionalIS1_EN3c108ArrayRefIS1_EENS4_4ListIlEEllldbbb+0x404)[0x7f73113a9874]
[A15-R45-I1-133-CB29022:10730] [12] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libth_common.so(_ZN3c104impl31make_boxed_from_unboxed_functorINS0_6detail31WrapFunctionIntoRuntimeFunctor_IPFSt6vectorIN2at6TensorESaIS6_EES6_St8optionalIS6_ENS_8ArrayRefIS6_EENS_4ListIlEEllldbbbES8_NS_4guts8typelist8typelistIJS6_SA_SC_SE_llldbbbEEEEELb0EE4callEPNS_14OperatorKernelERKNS_14OperatorHandleENS_14DispatchKeySetEPS4_INS_6IValueESaIST_EE+0x79c)[0x7f73113adc4c]
[A15-R45-I1-133-CB29022:10730] [13] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so(+0x55a6277)[0x7f7b6687a277]
[A15-R45-I1-133-CB29022:10730] [14] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(_ZN5torch3jit24invokeOperatorFromPythonERKSt6vectorISt10shared_ptrINS0_8OperatorEESaIS4_EERKN8pybind114argsERKNS9_6kwargsESt8optionalIN3c1011DispatchKeyEE+0xef)[0x7f7b6eb4741f]
[A15-R45-I1-133-CB29022:10730] [15] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(_ZN5torch3jit37_get_operation_for_overload_or_packetERKSt6vectorISt10shared_ptrINS0_8OperatorEESaIS4_EEN3c106SymbolERKN8pybind114argsERKNSB_6kwargsEbSt8optionalINS9_11DispatchKeyEE+0x229)[0x7f7b6eb476c9]
[A15-R45-I1-133-CB29022:10730] [16] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x8b5fd1)[0x7f7b6ea57fd1]
[A15-R45-I1-133-CB29022:10730] [17] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x42238d)[0x7f7b6e5c438d]
[A15-R45-I1-133-CB29022:10730] [18] /usr/bin/python[0x5820ff]
[A15-R45-I1-133-CB29022:10730] [19] /usr/bin/python(PyObject_Call+0x9c)[0x54b07c]
[A15-R45-I1-133-CB29022:10730] [20] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c3a)[0x5db68a]
[A15-R45-I1-133-CB29022:10730] [21] /usr/bin/python(_PyObject_Call_Prepend+0xc2)[0x54a712]
[A15-R45-I1-133-CB29022:10730] [22] /usr/bin/python[0x5a3698]
[A15-R45-I1-133-CB29022:10730] [23] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x548ec5]
[A15-R45-I1-133-CB29022:10730] [24] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d74d9]
[A15-R45-I1-133-CB29022:10730] [25] /usr/bin/python[0x54cae4]
[A15-R45-I1-133-CB29022:10730] [26] /usr/bin/python(PyObject_Call+0x119)[0x54b0f9]
[A15-R45-I1-133-CB29022:10730] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c3a)[0x5db68a]
[A15-R45-I1-133-CB29022:10730] [28] /usr/bin/python[0x54cae4]
[A15-R45-I1-133-CB29022:10730] [29] /usr/bin/python(PyObject_Call+0x119)[0x54b0f9]
[A15-R45-I1-133-CB29022:10730] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Apr 18 '25 08:04 bobbych94

Running into free(): double free detected in tcache 2 when using trtllm-bench in a multi-node scenario

build engines

serve model

Convert weights from HF Tranformers to TensorRT-LLM checkpoint

build engines

serve model

the error：

Child job 2 terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.