vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Error during inference with Mixtral 7bx8 GPTQ

Open mlinmg opened this issue 1 year ago • 8 comments

Traceback (most recent call last): File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish task.result() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop has_requests_in_progress = await self.engine_step() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 338, in engine_step request_outputs = await self.engine.step_async() File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 199, in step_async return self._process_model_outputs(output, scheduler_outputs) + ignored File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 562, in _process_model_outputs self._process_sequence_group_outputs(seq_group, outputs) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 554, in _process_sequence_group_outputs self.scheduler.free_seq(seq) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/scheduler.py", line 312, in free_seq self.block_manager.free(seq) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 277, in free self._free_block_table(block_table) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 268, in _free_block_table self.gpu_allocator.free(block) File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 48, in free raise ValueError(f"Double free! {block} is already freed.") ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=2611, ref_count=0) is already freed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi result = await app( # type: ignore[func-returns-value] File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call await super().call(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in call response = await self.dispatch_func(request, call_next) File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 63, in add_cors_header response = await call_next(request) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next raise app_exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro await self.app(scope, receive_or_disconnect, send_no_error) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 137, in generate async for request_output in results_generator: File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 445, in generate raise e File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 439, in generate async for request_output in stream: File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 70, in anext raise result File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish raise exc File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

mlinmg avatar Dec 26 '23 14:12 mlinmg

Got the same error with the origin model

oushu1zhangxiangxuan1 avatar Dec 27 '23 07:12 oushu1zhangxiangxuan1

got the same error with finetuned Mixtral 7bx8

adamlin120 avatar Dec 30 '23 02:12 adamlin120

I tried to load a GPTQ version of Mixtral 8x7b and got an error, but a different one than posted here.

I got:

config.py gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
config.py gptq does not support CUDA graph yet. Disabling CUDA graph.
worker.py -- Started a local Ray instance.
llm_engine.py Initializing an LLM engine with config: model='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, enforce_eager=True, seed=0)
Traceback (most recent call last):
  File "local_path/mixtral_vllm.py", line 3, in <module>
    llm = LLM("model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ", quantization="GPTQ", tensor_parallel_size=4)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 250, in from_engine_args
    engine = cls(*engine_configs,
             ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=X, ip=X.X.X.X, actor_id=X, repr=<vllm.engine.ray_utils.RayWorkerVllm object at X>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/model_executor/model_loader.py", line 55, in get_model
    raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]

I tried changing the datatype in the config.json to torch.float16 to try and fix it but instead got the same error as in https://github.com/vllm-project/vllm/issues/2251. Maybe these two errors are actually the same and related to vLLM not supporting torch.bfloat16? @casper-hansen

iibw avatar Jan 03 '24 05:01 iibw

You need to use float16 or half for quantization.

casper-hansen avatar Jan 04 '24 11:01 casper-hansen

@casper-hansen

You need to use float16 or half for quantization.

I switched it to torch.float16 in the config.json and my error changed to the one in https://github.com/vllm-project/vllm/issues/2251

iibw avatar Jan 04 '24 15:01 iibw

Did you try upgrading to the latest vLLM?

casper-hansen avatar Jan 04 '24 15:01 casper-hansen

I'll try doing that now

iibw avatar Jan 04 '24 15:01 iibw

Yep! It seems like the latest vLLM has fixed this bug. Both GPTQ and AWQ are working for me now. Thanks for the help :)

iibw avatar Jan 04 '24 20:01 iibw