FlashRAG IRCOT复现失败


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

max_len改成1024也不行。

Jun 09 '25 02:06 thunderbolt-fire

测试了下vllm框架确实会出现这个错误，可以把framework换成hf试一下

Jun 10 '25 02:06 lihai-zhao


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

max_len改成1024也不行。

测试后发现是transformers版本更新出现的问题，已经修复

Jun 10 '25 03:06 ignorejjj


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

max_len改成1024也不行。

测试后发现是transformers版本更新出现的问题，已经修复

更新以后依然报错

(flashrag) (base) root@219408eb5eb7:~/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods# python run_exp.py --method_name 'ircot'                   --split 'dev'                   --dataset_name 'hotpotqa'                   --gpu_id '0'
Loading dataset from: /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/FlashRAG_Dataset/hotpotqa...
Loading dev dataset from: /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/FlashRAG_Dataset/hotpotqa/dev.jsonl...
/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct
INFO 06-13 11:23:07 [__init__.py:239] Automatically detected platform cuda.
INFO 06-13 11:23:21 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 06-13 11:23:21 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-13 11:23:31 [__init__.py:239] Automatically detected platform cuda.
INFO 06-13 11:23:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 06-13 11:23:35 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9cab72d710>
[rank0]:[W613 11:23:36.675951062 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 06-13 11:23:36 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-13 11:23:36 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 06-13 11:23:36 [gpu_model_runner.py:1276] Starting to load model /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct...
WARNING 06-13 11:23:36 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.33s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:08,  4.31s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:05,  5.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  5.38s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  4.89s/it]

INFO 06-13 11:23:56 [loader.py:458] Loading weights took 19.68 seconds
INFO 06-13 11:23:56 [gpu_model_runner.py:1291] Model loading took 14.9596 GiB and 19.970492 seconds
INFO 06-13 11:24:19 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f3343ce5b8/rank_0_0 for vLLM's torch.compile
INFO 06-13 11:24:19 [backends.py:426] Dynamo bytecode transform time: 22.81 s
INFO 06-13 11:24:20 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 06-13 11:24:33 [monitor.py:33] torch.compile takes 22.81 s in total
INFO 06-13 11:24:34 [kv_cache_utils.py:634] GPU KV cache size: 414,944 tokens
INFO 06-13 11:24:34 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 101.30x
INFO 06-13 11:25:00 [gpu_model_runner.py:1626] Graph capturing finished in 26 secs, took 0.52 GiB
INFO 06-13 11:25:00 [core.py:163] init engine (profile, create kv cache, warmup model) took 63.73 seconds
INFO 06-13 11:25:00 [core_client.py:435] Core engine process 0 ready.
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [01:02<00:00,  2.25s/it]
Encoding process:   0%|                                                                                                                         | 0/2 [00:00<?, ?it/s]Use `query: ` as retreival instruction
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.76s/it]
Processed prompts: 100%|██████████████████████████████████████████████████| 500/500 [00:27<00:00, 18.07it/s, est. speed input: 22076.92 toks/s, output: 342.04 toks/s]
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.29it/s]
Processed prompts: 100%|██████████████████████████████████████████████████| 499/499 [00:42<00:00, 11.73it/s, est. speed input: 19748.72 toks/s, output: 146.20 toks/s]
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12.03it/s]
The input text length is greater than the maximum length (4503 > 4096) and has been truncated!
The input text length is greater than the maximum length (5569 > 4096) and has been truncated!
The input text length is greater than the maximum length (4434 > 4096) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 652, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 458, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/utils.py", line 1134, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 462, in generate
    self._validate_and_add_requests(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1342, in _validate_and_add_requests
    self._add_request(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1360, in _add_request
    self.llm_engine.add_request(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 186, in add_request
    request = self.processor.process_inputs(request_id, prompt, params,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 236, in process_inputs
    self._validate_model_inputs(processed_inputs, lora_request)
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 328, in _validate_model_inputs
    self._validate_model_input(decoder_inputs,
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 377, in _validate_model_input
    raise ValueError(
ValueError: The decoder prompt (length 4096) is longer than the maximum model length of 4096. Make sure that `max_model_len` is no smaller than the number of text tokens.
[rank0]:[W613 11:37:52.204596080 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Jun 13 '25 03:06 thunderbolt-fire

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新，你拉取一下我们最新的代码

Jun 13 '25 13:06 lihai-zhao

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新，你拉取一下我们最新的代码

已重新拉取了，依然报错同上

Jun 15 '25 06:06 thunderbolt-fire

在群里加你微信看下具体情况嘞

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新，你拉取一下我们最新的代码

已重新拉取了，依然报错同上

Jun 16 '25 02:06 lihai-zhao

https://github.com/vllm-project/vllm/pull/20750 This issue should have been resolved, but due to server issues, I am unable to upgrade vllm to a higher version to prevent conflicts with many Python libraries. How can I resolve this issue with 0.8.5.post1

Jul 18 '25 14:07 wannanfeng

vllm-project/vllm#20750 This issue should have been resolved, but due to server issues, I am unable to upgrade vllm to a higher version to prevent conflicts with many Python libraries. How can I resolve this issue with 0.8.5.post1

We have now fixed this issue, which has nothing to do with the vllm version. It was actually caused by truncation.

Jul 20 '25 06:07 ignorejjj

这个bug是怎么产生的

以 BAAI/bge-base-en 为例

在 sentence_bert_config.json 里设置了do_lower_case为true

在 tokenizer_config.json 里 special_tokens 为大写，比如 [UNK]

这就导致冲突，比如: 输入为 "[UNK]" * 10

执行 do_lower_case 逻辑

vllm:

https://github.com/vllm-project/vllm/blob/7ba34b1241ada58f8212f350a8b17382cb412cf2/vllm/inputs/preprocess.py#L215-L216

sentence-transformers:

https://github.com/UKPLab/sentence-transformers/blob/d5c8f5181daa468c63041fde18c659c4e4267e77/sentence_transformers/models/Transformer.py#L490-L492

special_tokens 也变成小写，tokenizer 可是只认大写的special_tokens，所以不起作用了

什么时候会使用 special_tokens

正常一般不会用到special_tokens，但是 truncation 场景会使用到 [UNK]

from transformers import AutoTokenizer
model_name = "BAAI/bge-base-en"
max_length = 100
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "数据hello数据"
tokens = tokenizer.tokenize(text)
tokens = tokens[:max_length]
truncate_text = tokenizer.convert_tokens_to_string(tokens)
print(truncate_text)
# '[UNK] [UNK] hello [UNK] [UNK]'

BAAI/bge-base-en 不支持中文，所以在 text -> tokens -> text 过程中，不支持的字符会变成 [UNK]

如何修复

修改 tokenizer_config.json, 将special_tokens 为大写改成小写： [UNK] -> [unk] 这样就不会跟 do_lower_case 冲突
如果 do_lower_case，自动将 special_tokens 转成小写，这也是 https://github.com/vllm-project/vllm/pull/20750 的修复方式
不使用 text -> tokens -> text，比如 vllm 支持 truncate_prompt_tokens 自动对输入做 truncation。但是有些时候手动 truncation 要灵活一点，vllm 也支持prompt_token_ids作为输入。

Jul 21 '25 03:07 noooop

@noooop @ignorejjj thanks!

Jul 22 '25 08:07 wannanfeng