vllm
vllm copied to clipboard
CUDA 12.1 vllm==0.2.3 Double Free
I tried this with FastChat that uses vLLM backend: Both inputs:
1. openai.ChatCompletion.create(
model=model,
messages=(
[
{"role": "user", "content": prompt}
]
),
stream=False,
# temperature=args.temperature,
presence_penalty=0.0,
frequency_penalty=0.0,
max_tokens=max_tokens,
best_of=best_of,
n=n,
temperature=temperature,
top_p=top_p,
top_k=top_k,
use_beam_search=True)
2. openai.ChatCompletion.create(
model=model,
messages=(
[
{"role": "user", "content": prompt}
]
),
stream=True,
# temperature=args.temperature,
presence_penalty=0.0,
frequency_penalty=0.0,
max_tokens=max_tokens,
best_of=best_of,
n=n,
temperature=temperature,
top_p=top_p,
top_k=top_k,
use_beam_search=True)
raises the following error
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
2023-12-05 08:52:19 | ERROR | stderr | task.result()
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
2023-12-05 08:52:19 | ERROR | stderr | has_requests_in_progress = await self.engine_step()
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 338, in engine_step
2023-12-05 08:52:19 | ERROR | stderr | request_outputs = await self.engine.step_async()
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 199, in step_async
2023-12-05 08:52:19 | ERROR | stderr | return self._process_model_outputs(output, scheduler_outputs) + ignored
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 545, in _process_model_outputs
2023-12-05 08:52:19 | ERROR | stderr | self._process_sequence_group_outputs(seq_group, outputs)
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 537, in _process_sequence_group_outputs
2023-12-05 08:52:19 | ERROR | stderr | self.scheduler.free_seq(seq)
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/core/scheduler.py", line 310, in free_seq
2023-12-05 08:52:19 | ERROR | stderr | self.block_manager.free(seq)
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 277, in free
2023-12-05 08:52:19 | ERROR | stderr | self._free_block_table(block_table)
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 268, in _free_block_table
2023-12-05 08:52:19 | ERROR | stderr | self.gpu_allocator.free(block)
2023-12-05 08:52:19 | ERROR | stderr | File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 48, in free
2023-12-05 08:52:19 | ERROR | stderr | raise ValueError(f"Double free! {block} is already freed.")
2023-12-05 08:52:19 | ERROR | stderr | ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=11634, ref_count=0) is already freed.
Hi @tjtanaa thanks for reporting the bug! Which model are you using? Is it Mistral?
Yes. I am using openhermes-2.5 which is based on Mistral
It is happening for me as well, cuda 12.1, vllm 0.2.6 with Mixtral 8x7B, for long prompts.
@WoosukKwon any tips on this?
+1
+1, Samme issue here using CUDA/12.1.1, Python/3.10.4-GCCcore-11.3.0, vllm==0.2.3. Happened after 5-10 inferences with a lora fine tuned mistral 7b model vllm_double_free_bug.log
EDIT: In our case the fine-tuned model was trained with 1024 input tokens, when this was exceeded it caused the double free error.