FP8量化支持
可否加入对FP8量化模型的支持 vllm0.43中加入了FP8量化模型的支持。 我尝试在xin中注册Qwen2-7B-Instruct-FP8这个模型,启动时报错见附件日志 xin报错日志.txt 相同环境下,我直接通过vllm命令行方式是可以启动的python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.9 --served-model-name Qwen2-7B-Instruct-FP8 --model /data2/Qwen2-7B-Instruct-FP8 vllm启动过程见日志附件 vllm日志.txt 对比2个日志可以看到无论是从xin启动还是从vllm命令行启动,传递给vllm的llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config:所带的参数都是一样的,环境也相同,不知道为何从xin启动就会失败。 不知道后续xin能否支持,或者现在有啥解决方案吗 谢谢
vllm升级到0.5.3了还是一样
This issue is stale because it has been open for 7 days with no activity.
@xbl916 我们计划支持 fp8。你当时加载 xinference 选择的哪个格式?
@xbl916我们计划支持fp8。您当时加载xinference选择哪个格式?
我是在xin上自定义注册的 { "version": 1, "context_length": 32768, "model_name": "qwen2-fp8", "model_lang": [ "en", "zh" ], "model_ability": [ "generate", "chat" ], "model_description": "This is a custom model description.", "model_family": "qwen2-instruct", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": 7, "quantizations": [ "none" ], "model_id": null, "model_hub": "huggingface", "model_uri": "/data2/Qwen2-7B-Instruct-FP8", "model_revision": null } ], "prompt_style": { "style_name": "QWEN", "system_prompt": "You are a helpful assistant.", "roles": [ "user", "assistant" ], "intra_message_sep": "\n", "inter_message_sep": "", "stop": [ "<|endoftext|>", "<|im_start|>", "<|im_end|>" ], "stop_token_ids": [ 151643, 151644, 151645 ] }, "is_builtin": false } 启动选的vllm,看日志vllm是识别到了fp8量化
嗯,我试了下旧版似乎直接用 pytorch 格式,无量化能加载。