vllm
vllm copied to clipboard
Error during inference with Mixtral 7bx8 GPTQ
Traceback (most recent call last): File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish task.result() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop has_requests_in_progress = await self.engine_step() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 338, in engine_step request_outputs = await self.engine.step_async() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 199, in step_async return self._process_model_outputs(output, scheduler_outputs) + ignored File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 562, in _process_model_outputs self._process_sequence_group_outputs(seq_group, outputs) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 554, in _process_sequence_group_outputs self.scheduler.free_seq(seq) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/scheduler.py", line 312, in free_seq self.block_manager.free(seq) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 277, in free self._free_block_table(block_table) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 268, in _free_block_table self.gpu_allocator.free(block) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 48, in free raise ValueError(f"Double free! {block} is already freed.") ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=2611, ref_count=0) is already freed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call await super().call(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in call response = await self.dispatch_func(request, call_next) File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 63, in add_cors_header response = await call_next(request) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next raise app_exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro await self.app(scope, receive_or_disconnect, send_no_error) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 137, in generate async for request_output in results_generator: File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 445, in generate raise e File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 439, in generate async for request_output in stream: File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 70, in anext raise result File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish raise exc File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Got the same error with the origin model
got the same error with finetuned Mixtral 7bx8
I tried to load a GPTQ version of Mixtral 8x7b and got an error, but a different one than posted here.
I got:
config.py gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
config.py gptq does not support CUDA graph yet. Disabling CUDA graph.
worker.py -- Started a local Ray instance.
llm_engine.py Initializing an LLM engine with config: model='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, enforce_eager=True, seed=0)
Traceback (most recent call last):
File "local_path/mixtral_vllm.py", line 3, in <module>
llm = LLM("model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ", quantization="GPTQ", tensor_parallel_size=4)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 250, in from_engine_args
engine = cls(*engine_configs,
^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
self._init_workers_ray(placement_group)
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
self._run_workers(
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
all_outputs = ray.get(all_outputs)
^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=X, ip=X.X.X.X, actor_id=X, repr=<vllm.engine.ray_utils.RayWorkerVllm object at X>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/worker.py", line 79, in load_model
self.model_runner.load_model()
File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 57, in load_model
self.model = get_model(self.model_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "anaconda_env_path/lib/python3.11/site-packages/vllm/model_executor/model_loader.py", line 55, in get_model
raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]
I tried changing the datatype in the config.json to torch.float16 to try and fix it but instead got the same error as in https://github.com/vllm-project/vllm/issues/2251. Maybe these two errors are actually the same and related to vLLM not supporting torch.bfloat16? @casper-hansen
You need to use float16 or half for quantization.
@casper-hansen
You need to use float16 or half for quantization.
I switched it to torch.float16 in the config.json and my error changed to the one in https://github.com/vllm-project/vllm/issues/2251
Did you try upgrading to the latest vLLM?
I'll try doing that now
Yep! It seems like the latest vLLM has fixed this bug. Both GPTQ and AWQ are working for me now. Thanks for the help :)