vllm CUDA 12.1 vllm==0.2.3 Double Free

I tried this with FastChat that uses vLLM backend: Both inputs:

1. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=False,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)

2. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=True,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)

raises the following error

2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish  
2023-12-05 08:52:19 | ERROR | stderr |     task.result()                                                                                            
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop            
2023-12-05 08:52:19 | ERROR | stderr |     has_requests_in_progress = await self.engine_step()                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 338, in engine_step                
2023-12-05 08:52:19 | ERROR | stderr |     request_outputs = await self.engine.step_async()                                                         
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 199, in step_async                 
2023-12-05 08:52:19 | ERROR | stderr |     return self._process_model_outputs(output, scheduler_outputs) + ignored                                  
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 545, in _process_model_outputs           
2023-12-05 08:52:19 | ERROR | stderr |     self._process_sequence_group_outputs(seq_group, outputs)                                                 
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 537, in _process_sequence_group_outputs  
2023-12-05 08:52:19 | ERROR | stderr |     self.scheduler.free_seq(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/scheduler.py", line 310, in free_seq                            
2023-12-05 08:52:19 | ERROR | stderr |     self.block_manager.free(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 277, in free                            
2023-12-05 08:52:19 | ERROR | stderr |     self._free_block_table(block_table)                                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 268, in _free_block_table               
2023-12-05 08:52:19 | ERROR | stderr |     self.gpu_allocator.free(block)                                                                           
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 48, in free                             
2023-12-05 08:52:19 | ERROR | stderr |     raise ValueError(f"Double free! {block} is already freed.")                                              
2023-12-05 08:52:19 | ERROR | stderr | ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=11634, ref_count=0) is already freed.

Dec 05 '23 08:12 tjtanaa

Hi @tjtanaa thanks for reporting the bug! Which model are you using? Is it Mistral?

Dec 05 '23 09:12 WoosukKwon

Yes. I am using openhermes-2.5 which is based on Mistral

Dec 05 '23 09:12 tjtanaa

It is happening for me as well, cuda 12.1, vllm 0.2.6 with Mixtral 8x7B, for long prompts.

Dec 21 '23 14:12 qati

@WoosukKwon any tips on this?

Dec 27 '23 11:12 qati

+1

Jan 04 '24 12:01 nxphi47

+1, Samme issue here using CUDA/12.1.1, Python/3.10.4-GCCcore-11.3.0, vllm==0.2.3. Happened after 5-10 inferences with a lora fine tuned mistral 7b model vllm_double_free_bug.log

EDIT: In our case the fine-tuned model was trained with 1024 input tokens, when this was exceeded it caused the double free error.

Jan 11 '24 10:01 jonaslsaa

vllm vllm copied to clipboard

CUDA 12.1 vllm==0.2.3 Double Free

vllm
vllm copied to clipboard