inference FP8量化支持

可否加入对FP8量化模型的支持 vllm0.43中加入了FP8量化模型的支持。我尝试在xin中注册Qwen2-7B-Instruct-FP8这个模型，启动时报错见附件日志 xin报错日志.txt 相同环境下，我直接通过vllm命令行方式是可以启动的python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.9 --served-model-name Qwen2-7B-Instruct-FP8 --model /data2/Qwen2-7B-Instruct-FP8 vllm启动过程见日志附件 vllm日志.txt 对比2个日志可以看到无论是从xin启动还是从vllm命令行启动，传递给vllm的llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config:所带的参数都是一样的，环境也相同，不知道为何从xin启动就会失败。不知道后续xin能否支持，或者现在有啥解决方案吗谢谢

Jun 15 '24 07:06 xbl916

vllm升级到0.5.3了还是一样

Jul 24 '24 08:07 xbl916

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 06:08 github-actions[bot]

@xbl916 我们计划支持 fp8。你当时加载 xinference 选择的哪个格式？

Aug 12 '24 11:08 qinxuye

@xbl916我们计划支持fp8。您当时加载xinference选择哪个格式？

我是在xin上自定义注册的 { "version": 1, "context_length": 32768, "model_name": "qwen2-fp8", "model_lang": [ "en", "zh" ], "model_ability": [ "generate", "chat" ], "model_description": "This is a custom model description.", "model_family": "qwen2-instruct", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": 7, "quantizations": [ "none" ], "model_id": null, "model_hub": "huggingface", "model_uri": "/data2/Qwen2-7B-Instruct-FP8", "model_revision": null } ], "prompt_style": { "style_name": "QWEN", "system_prompt": "You are a helpful assistant.", "roles": [ "user", "assistant" ], "intra_message_sep": "\n", "inter_message_sep": "", "stop": [ "<|endoftext|>", "<|im_start|>", "<|im_end|>" ], "stop_token_ids": [ 151643, 151644, 151645 ] }, "is_builtin": false } 启动选的vllm，看日志vllm是识别到了fp8量化

Aug 12 '24 11:08 xbl916

嗯，我试了下旧版似乎直接用 pytorch 格式，无量化能加载。

Aug 12 '24 12:08 qinxuye