"Trying to remove block n by 0 that is not in hash map" spam in release 0.17
System Info
- 8x 4090 on dual Epyc server running Debian testing
- CUDA toolkit version 12.8, driver version 570.86
- Release container compiled from release 0.17 tag
Who can help?
Maybe @kaiyux or @byshiue
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
-
Convert a checkpoint
python examples/llama/convert_checkpoint.py --model_dir /raid/models/raw/Mistral-Small-24B-Instruct-2501/ --tp_size 4 --dtype bfloat16 --output_dir /raid/models/checkpoint/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4 --workers 4 -
Build an engine
trtllm-build --checkpoint_dir /raid/models/checkpoint/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4/ --output_dir /raid/models/engine/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4 --max_batch_size 4 --workers 4 --gemm_plugin bfloat16 --use_paged_context_fmha enable -
Run the engine
NCCL_P2P_LEVEL=SYS mpirun -n 4 --allow-run-as-root python examples/run.py --max_output_len 256 --engine_dir /raid/models/engine/Mistral-Small-24B-Instruct-2501_sm89_bf16_tp4/ --tokenizer_dir /raid/models/raw/Mistral-Small-24B-Instruct-2501/
Expected behavior
The engine runs normally
actual behavior
The engine runs, but there is log spam:
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.27 GiB for max tokens in paged KV cache (243072).
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808323860168457 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808316230773926 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808536767959595 sec
[01/31/2025-05:22:09] [TRT-LLM] [I] Load engine takes: 4.808875560760498 sec
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 2 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 3 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 4 by 0 that is not in hash map
Input [Text 0]: "<s>Born in north-east France, Soyer trained as a"
additional notes
This is a regression, it did not happen in 0.16 or the previous main snapshot.
When trying to use lookahead_decoding, the log gets spammed so much that it causes more overhead than was gained from the speculative decoding.
This also occurs with trtllm-serve, not just run.py, which is much worse!
Model config
{
"version": "0.17.0.post1",
"pretrained_config": {
"mlp_bias": false,
"attn_bias": false,
"rotary_base": 500000.0,
"rotary_scaling": {
"factor": 16.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"residual_mlp": false,
"disable_weight_only_quant_plugin": false,
"moe": {
"num_experts": 0,
"shared_expert_intermediate_size": 0,
"top_k": 0,
"normalization_mode": null,
"sparse_mixer_epsilon": 0.01,
"tp_mode": 0,
"device_limited_n_group": 0,
"device_limited_topk_group": 0,
"device_limited_routed_scaling_factor": 1.0
},
"remove_duplicated_kv_heads": false,
"fc_after_embed": false,
"use_input_layernorm_in_first_layer": true,
"use_last_layernorm": true,
"layer_idx_offset": 0,
"embedding_multiplier": 1.0,
"attention_multiplier": 1.0,
"residual_multiplier": 1.0,
"output_multiplier_scale": 1.0,
"architecture": "LlamaForCausalLM",
"dtype": "bfloat16",
"vocab_size": 128256,
"hidden_size": 8192,
"num_hidden_layers": 92,
"num_attention_heads": 64,
"hidden_act": "silu",
"logits_dtype": "float32",
"norm_epsilon": 1e-05,
"runtime_defaults": null,
"position_embedding_type": "rope_gpt_neox",
"num_key_value_heads": 8,
"intermediate_size": 32768,
"max_position_embeddings": 131072,
"mapping": {
"world_size": 4,
"gpus_per_node": 8,
"cp_size": 1,
"tp_size": 4,
"pp_size": 1,
"moe_tp_size": 4,
"moe_ep_size": 1,
"auto_parallel": false
},
"quantization": {
"quant_algo": null,
"kv_cache_quant_algo": null,
"group_size": 128,
"smoothquant_val": 0.5,
"clamp_val": null,
"use_meta_recipe": false,
"has_zero_point": false,
"pre_quant_scale": false,
"exclude_modules": null
},
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"head_size": 128,
"qk_layernorm": false,
"rotary_embedding_dim": 128,
"tie_word_embeddings": false
},
"build_config": {
"max_input_len": 130048,
"max_seq_len": 131072,
"opt_batch_size": 8,
"max_batch_size": 256,
"max_beam_width": 1,
"max_num_tokens": 4096,
"opt_num_tokens": 256,
"max_prompt_embedding_table_size": 0,
"kv_cache_type": "PAGED",
"gather_context_logits": false,
"gather_generation_logits": false,
"strongly_typed": true,
"force_num_profiles": null,
"profiling_verbosity": "layer_names_only",
"enable_debug_output": false,
"max_draft_len": 0,
"speculative_decoding_mode": 1,
"use_refit": false,
"input_timing_cache": null,
"output_timing_cache": "model.cache",
"lora_config": {
"lora_dir": [],
"lora_ckpt_source": "hf",
"max_lora_rank": 64,
"lora_target_modules": [],
"trtllm_modules_to_hf_modules": {}
},
"auto_parallel_config": {
"world_size": 1,
"gpus_per_node": 8,
"cluster_key": "H100-PCIe",
"cluster_info": null,
"sharding_cost_model": "alpha_beta",
"comm_cost_model": "alpha_beta",
"enable_pipeline_parallelism": false,
"enable_shard_unbalanced_shape": false,
"enable_shard_dynamic_shape": false,
"enable_reduce_scatter": true,
"builder_flags": null,
"debug_mode": false,
"infer_shape": true,
"validation_mode": false,
"same_buffer_io": {
"past_key_value_(\\d+)": "present_key_value_\\1"
},
"same_spec_io": {},
"sharded_io_allowlist": [
"past_key_value_\\d+",
"present_key_value_\\d*"
],
"fill_weights": false,
"parallel_config_cache": null,
"profile_cache": null,
"dump_path": null,
"debug_outputs": []
},
"weight_sparsity": false,
"weight_streaming": false,
"plugin_config": {
"dtype": "bfloat16",
"bert_attention_plugin": "auto",
"gpt_attention_plugin": "auto",
"gemm_plugin": "bfloat16",
"explicitly_disable_gemm_plugin": false,
"gemm_swiglu_plugin": null,
"fp8_rowwise_gemm_plugin": null,
"qserve_gemm_plugin": null,
"identity_plugin": null,
"nccl_plugin": "bfloat16",
"lora_plugin": null,
"weight_only_groupwise_quant_matmul_plugin": null,
"weight_only_quant_matmul_plugin": null,
"smooth_quant_plugins": true,
"smooth_quant_gemm_plugin": null,
"layernorm_quantization_plugin": null,
"rmsnorm_quantization_plugin": null,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"moe_plugin": "auto",
"mamba_conv1d_plugin": "auto",
"low_latency_gemm_plugin": null,
"low_latency_gemm_swiglu_plugin": null,
"context_fmha": true,
"bert_context_fmha_fp32_acc": false,
"paged_kv_cache": true,
"remove_input_padding": true,
"reduce_fusion": false,
"user_buffer": false,
"tokens_per_block": 64,
"use_paged_context_fmha": true,
"use_fp8_context_fmha": false,
"multiple_profiles": false,
"paged_state": false,
"streamingllm": false,
"manage_weights": false,
"use_fused_mlp": true,
"pp_reduce_scatter": false
},
"use_strip_plan": false,
"max_encoder_input_len": 1024,
"monitor_memory": false,
"use_mrope": false
}
}
Run benchmark
CUDA_VISIBLE_DEVICES=4,5,6,7 mpirun -n 4 ./gptManagerBenchmark --engine_dir /data/models/4-gpu --dataset /datasets/cnn_dailymail.json --request_rate 5
Spam
....
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1230 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1460 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1230 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1460 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1558 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1598 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1628 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1647 by 0 that is not in hash map
[TensorRT-LLM][WARNING] Trying to remove block 1657 by 0 that is not in hash map
....
This warning can be ignored , will be fixed in next release.
You can assign me. Fix is on the way.
Got same warnings when serving using triton backend
This should be fixed now. Please verify and confirm if you can.
I had this issue for my qwen2.5 14b model in 0.17 Does it means I need to upgrade to a newer version?
__
Yes this has been fixed in releases > 0.17