FlashRAG icon indicating copy to clipboard operation
FlashRAG copied to clipboard

IRCOT复现失败

Open thunderbolt-fire opened this issue 6 months ago • 10 comments


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

Image

max_len改成1024也不行。

thunderbolt-fire avatar Jun 09 '25 02:06 thunderbolt-fire

测试了下vllm框架确实会出现这个错误,可以把framework换成hf试一下

lihai-zhao avatar Jun 10 '25 02:06 lihai-zhao


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

Image

max_len改成1024也不行。

测试后发现是transformers版本更新出现的问题,已经修复

ignorejjj avatar Jun 10 '25 03:06 ignorejjj


The input text length is greater than the maximum length (8376 > 8192) and has been truncated!
The input text length is greater than the maximum length (8607 > 8192) and has been truncated!
The input text length is greater than the maximum length (8295 > 8192) and has been truncated!
The input text length is greater than the maximum length (8733 > 8192) and has been truncated!
The input text length is greater than the maximum length (8592 > 8192) and has been truncated!
The input text length is greater than the maximum length (9489 > 8192) and has been truncated!
The input text length is greater than the maximum length (8283 > 8192) and has been truncated!
The input text length is greater than the maximum length (8330 > 8192) and has been truncated!
The input text length is greater than the maximum length (9055 > 8192) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 650, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 456, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/


ValueError: The decoder prompt (length 8192) is longer than the maximum model length of 8192. Make sure that `max_model_len` is no smaller than the number of text tokens.

Image max_len改成1024也不行。

测试后发现是transformers版本更新出现的问题,已经修复

更新以后依然报错

(flashrag) (base) root@219408eb5eb7:~/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods# python run_exp.py --method_name 'ircot'                   --split 'dev'                   --dataset_name 'hotpotqa'                   --gpu_id '0'
Loading dataset from: /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/FlashRAG_Dataset/hotpotqa...
Loading dev dataset from: /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/FlashRAG_Dataset/hotpotqa/dev.jsonl...
/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct
INFO 06-13 11:23:07 [__init__.py:239] Automatically detected platform cuda.
INFO 06-13 11:23:21 [config.py:689] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 06-13 11:23:21 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-13 11:23:31 [__init__.py:239] Automatically detected platform cuda.
INFO 06-13 11:23:35 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 06-13 11:23:35 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9cab72d710>
[rank0]:[W613 11:23:36.675951062 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 06-13 11:23:36 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-13 11:23:36 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 06-13 11:23:36 [gpu_model_runner.py:1276] Starting to load model /root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/models/Meta-Llama-3-8B-Instruct...
WARNING 06-13 11:23:36 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.33s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:08,  4.31s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:05,  5.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  5.38s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00,  4.89s/it]

INFO 06-13 11:23:56 [loader.py:458] Loading weights took 19.68 seconds
INFO 06-13 11:23:56 [gpu_model_runner.py:1291] Model loading took 14.9596 GiB and 19.970492 seconds
INFO 06-13 11:24:19 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f3343ce5b8/rank_0_0 for vLLM's torch.compile
INFO 06-13 11:24:19 [backends.py:426] Dynamo bytecode transform time: 22.81 s
INFO 06-13 11:24:20 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 06-13 11:24:33 [monitor.py:33] torch.compile takes 22.81 s in total
INFO 06-13 11:24:34 [kv_cache_utils.py:634] GPU KV cache size: 414,944 tokens
INFO 06-13 11:24:34 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 101.30x
INFO 06-13 11:25:00 [gpu_model_runner.py:1626] Graph capturing finished in 26 secs, took 0.52 GiB
INFO 06-13 11:25:00 [core.py:163] init engine (profile, create kv cache, warmup model) took 63.73 seconds
INFO 06-13 11:25:00 [core_client.py:435] Core engine process 0 ready.
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [01:02<00:00,  2.25s/it]
Encoding process:   0%|                                                                                                                         | 0/2 [00:00<?, ?it/s]Use `query: ` as retreival instruction
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.76s/it]
Processed prompts: 100%|██████████████████████████████████████████████████| 500/500 [00:27<00:00, 18.07it/s, est. speed input: 22076.92 toks/s, output: 342.04 toks/s]
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.29it/s]
Processed prompts: 100%|██████████████████████████████████████████████████| 499/499 [00:42<00:00, 11.73it/s, est. speed input: 19748.72 toks/s, output: 146.20 toks/s]
Encoding process: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12.03it/s]
The input text length is greater than the maximum length (4503 > 4096) and has been truncated!
The input text length is greater than the maximum length (5569 > 4096) and has been truncated!
The input text length is greater than the maximum length (4434 > 4096) and has been truncated!
Traceback (most recent call last):
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 652, in <module>
    func(args)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/examples/methods/run_exp.py", line 458, in ircot
    result = pipeline.run(test_data)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 1040, in run
    self.run_batch(dataset)
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/pipeline/active_pipeline.py", line 986, in run_batch
    new_thoughts_batch = self.generator.generate(input_prompts, stop=['.', '\n'])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/siton-data-0553377b2d664236bad5b5d0ba8aa419/workspace/FlashRAG/flashrag/generator/generator.py", line 258, in generate
    outputs = self.model.generate(input_list, sampling_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/utils.py", line 1134, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 462, in generate
    self._validate_and_add_requests(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1342, in _validate_and_add_requests
    self._add_request(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1360, in _add_request
    self.llm_engine.add_request(
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/llm_engine.py", line 186, in add_request
    request = self.processor.process_inputs(request_id, prompt, params,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 236, in process_inputs
    self._validate_model_inputs(processed_inputs, lora_request)
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 328, in _validate_model_inputs
    self._validate_model_input(decoder_inputs,
  File "/opt/conda/envs/flashrag/lib/python3.11/site-packages/vllm/v1/engine/processor.py", line 377, in _validate_model_input
    raise ValueError(
ValueError: The decoder prompt (length 4096) is longer than the maximum model length of 4096. Make sure that `max_model_len` is no smaller than the number of text tokens.
[rank0]:[W613 11:37:52.204596080 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

thunderbolt-fire avatar Jun 13 '25 03:06 thunderbolt-fire

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新,你拉取一下我们最新的代码

lihai-zhao avatar Jun 13 '25 13:06 lihai-zhao

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新,你拉取一下我们最新的代码

已重新拉取了,依然报错同上

thunderbolt-fire avatar Jun 15 '25 06:06 thunderbolt-fire

在群里加你微信看下具体情况嘞

更新以后依然报错

对于这个bug我们对工具库的代码也进行了更新,你拉取一下我们最新的代码

已重新拉取了,依然报错同上

lihai-zhao avatar Jun 16 '25 02:06 lihai-zhao

https://github.com/vllm-project/vllm/pull/20750 This issue should have been resolved, but due to server issues, I am unable to upgrade vllm to a higher version to prevent conflicts with many Python libraries. How can I resolve this issue with 0.8.5.post1

wannanfeng avatar Jul 18 '25 14:07 wannanfeng

vllm-project/vllm#20750 This issue should have been resolved, but due to server issues, I am unable to upgrade vllm to a higher version to prevent conflicts with many Python libraries. How can I resolve this issue with 0.8.5.post1

We have now fixed this issue, which has nothing to do with the vllm version. It was actually caused by truncation.

ignorejjj avatar Jul 20 '25 06:07 ignorejjj

这个bug是怎么产生的

以 BAAI/bge-base-en 为例

sentence_bert_config.json 里设置了do_lower_case为true

tokenizer_config.json 里 special_tokens 为大写,比如 [UNK]

这就导致冲突,比如: 输入为 "[UNK]" * 10

执行 do_lower_case 逻辑

vllm:

https://github.com/vllm-project/vllm/blob/7ba34b1241ada58f8212f350a8b17382cb412cf2/vllm/inputs/preprocess.py#L215-L216

sentence-transformers:

https://github.com/UKPLab/sentence-transformers/blob/d5c8f5181daa468c63041fde18c659c4e4267e77/sentence_transformers/models/Transformer.py#L490-L492

special_tokens 也变成小写,tokenizer 可是只认大写的special_tokens, 所以不起作用了

什么时候会使用 special_tokens

正常一般不会用到special_tokens,但是 truncation 场景会使用到 [UNK]

from transformers import AutoTokenizer
model_name = "BAAI/bge-base-en"
max_length = 100
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "数据hello数据"
tokens = tokenizer.tokenize(text)
tokens = tokens[:max_length]
truncate_text = tokenizer.convert_tokens_to_string(tokens)
print(truncate_text)
# '[UNK] [UNK] hello [UNK] [UNK]'

BAAI/bge-base-en 不支持中文,所以在 text -> tokens -> text 过程中,不支持的字符会变成 [UNK]

如何修复

  1. 修改 tokenizer_config.json, 将special_tokens 为大写改成小写: [UNK] -> [unk] 这样就不会跟 do_lower_case 冲突
  2. 如果 do_lower_case,自动将 special_tokens 转成小写,这也是 https://github.com/vllm-project/vllm/pull/20750 的修复方式
  3. 不使用 text -> tokens -> text,比如 vllm 支持 truncate_prompt_tokens 自动对输入做 truncation。但是有些时候手动 truncation 要灵活一点,vllm 也支持prompt_token_ids作为输入。

noooop avatar Jul 21 '25 03:07 noooop

@noooop @ignorejjj thanks!

wannanfeng avatar Jul 22 '25 08:07 wannanfeng