TensorRT-LLM "Trying to remove block n by 0 that is not in hash map" spam in release 0.17

System Info

8x 4090 on dual Epyc server running Debian testing
CUDA toolkit version 12.8, driver version 570.86
Release container compiled from release 0.17 tag

Who can help?

Maybe @kaiyux or @byshiue

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Convert a checkpoint
python examples/llama/convert_checkpoint.py --model_dir /raid/models/raw/Mistral-Small-24B-Instruct-2501/ --tp_size 4 --dtype bfloat16 --output_dir /raid/models/checkpoint/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4 --workers 4
Build an engine trtllm-build --checkpoint_dir /raid/models/checkpoint/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4/ --output_dir /raid/models/engine/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4 --max_batch_size 4 --workers 4 --gemm_plugin bfloat16 --use_paged_context_fmha enable
Run the engine NCCL_P2P_LEVEL=SYS mpirun -n 4 --allow-run-as-root python examples/run.py --max_output_len 256 --engine_dir /raid/models/engine/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4/ --tokenizer_dir /raid/models/raw/Mistral-Small-24B-Instruct-2501/

Expected behavior

The engine runs normally

actual behavior

The engine runs, but there is log spam:

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.27 GiB for max tokens in paged KV cache (243072).
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808323860168457 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808316230773926 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808536767959595 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808875560760498 sec
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
Input [Text 0]: "<s>Born in north-east France, Soyer trained as a"

additional notes

This is a regression, it did not happen in 0.16 or the previous main snapshot.

When trying to use lookahead_decoding, the log gets spammed so much that it causes more overhead than was gained from the speculative decoding.

Jan 31 '25 05:01 aikitoria

This also occurs with trtllm-serve, not just run.py, which is much worse!

Jan 31 '25 05:01 aikitoria

Model config

{
    "version": "0.17.0.post1",
    "pretrained_config": {
        "mlp_bias": false,
        "attn_bias": false,
        "rotary_base": 500000.0,
        "rotary_scaling": {
            "factor": 16.0,
            "high_freq_factor": 4.0,
            "low_freq_factor": 1.0,
            "original_max_position_embeddings": 8192,
            "rope_type": "llama3"
        },
        "residual_mlp": false,
        "disable_weight_only_quant_plugin": false,
        "moe": {
            "num_experts": 0,
            "shared_expert_intermediate_size": 0,
            "top_k": 0,
            "normalization_mode": null,
            "sparse_mixer_epsilon": 0.01,
            "tp_mode": 0,
            "device_limited_n_group": 0,
            "device_limited_topk_group": 0,
            "device_limited_routed_scaling_factor": 1.0
        },
        "remove_duplicated_kv_heads": false,
        "fc_after_embed": false,
        "use_input_layernorm_in_first_layer": true,
        "use_last_layernorm": true,
        "layer_idx_offset": 0,
        "embedding_multiplier": 1.0,
        "attention_multiplier": 1.0,
        "residual_multiplier": 1.0,
        "output_multiplier_scale": 1.0,
        "architecture": "LlamaForCausalLM",
        "dtype": "bfloat16",
        "vocab_size": 128256,
        "hidden_size": 8192,
        "num_hidden_layers": 92,
        "num_attention_heads": 64,
        "hidden_act": "silu",
        "logits_dtype": "float32",
        "norm_epsilon": 1e-05,
        "runtime_defaults": null,
        "position_embedding_type": "rope_gpt_neox",
        "num_key_value_heads": 8,
        "intermediate_size": 32768,
        "max_position_embeddings": 131072,
        "mapping": {
            "world_size": 4,
            "gpus_per_node": 8,
            "cp_size": 1,
            "tp_size": 4,
            "pp_size": 1,
            "moe_tp_size": 4,
            "moe_ep_size": 1,
            "auto_parallel": false
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": 0.5,
            "clamp_val": null,
            "use_meta_recipe": false,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": null
        },
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "head_size": 128,
        "qk_layernorm": false,
        "rotary_embedding_dim": 128,
        "tie_word_embeddings": false
    },
    "build_config": {
        "max_input_len": 130048,
        "max_seq_len": 131072,
        "opt_batch_size": 8,
        "max_batch_size": 256,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        "opt_num_tokens": 256,
        "max_prompt_embedding_table_size": 0,
        "kv_cache_type": "PAGED",
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": true,
        "force_num_profiles": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "speculative_decoding_mode": 1,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "H100-PCIe",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "weight_streaming": false,
        "plugin_config": {
            "dtype": "bfloat16",
            "bert_attention_plugin": "auto",
            "gpt_attention_plugin": "auto",
            "gemm_plugin": "bfloat16",
            "explicitly_disable_gemm_plugin": false,
            "gemm_swiglu_plugin": null,
            "fp8_rowwise_gemm_plugin": null,
            "qserve_gemm_plugin": null,
            "identity_plugin": null,
            "nccl_plugin": "bfloat16",
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "smooth_quant_plugins": true,
            "smooth_quant_gemm_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "auto",
            "mamba_conv1d_plugin": "auto",
            "low_latency_gemm_plugin": null,
            "low_latency_gemm_swiglu_plugin": null,
            "context_fmha": true,
            "bert_context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "reduce_fusion": false,
            "user_buffer": false,
            "tokens_per_block": 64,
            "use_paged_context_fmha": true,
            "use_fp8_context_fmha": false,
            "multiple_profiles": false,
            "paged_state": false,
            "streamingllm": false,
            "manage_weights": false,
            "use_fused_mlp": true,
            "pp_reduce_scatter": false
        },
        "use_strip_plan": false,
        "max_encoder_input_len": 1024,
        "monitor_memory": false,
        "use_mrope": false
    }
}

Run benchmark

CUDA_VISIBLE_DEVICES=4,5,6,7 mpirun -n 4 ./gptManagerBenchmark --engine_dir /data/models/4-gpu --dataset /datasets/cnn_dailymail.json --request_rate 5

Spam

....
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1230 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1460 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1230 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1460 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
....