arkohut
arkohut
Great question. I have similar issue.
https://github.com/vllm-project/vllm/issues/2413 This may be helpful.
> The more code you have in the repo that isn't the main purpose of the project, the harder it is to maintain high quality code and quickly deliver features....
> Can you move it to `examples/gradio_openai_chatbot_webserver.py`? It's done.
Do I need to do any other update? @zhuohan123
Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance. ## GPU A40 48G VRAM ## vLLM version `0.2.6` The latest version `0.2.7` will run out of memory for...
Sorry for the wrong info, during my test awq is much faster than gptq. I already updated the message.
So maybe the MoE model is quite different?
1. The particular feature is that it includes a free frp server from gradio. It is maintained by huggingface. So it looks quite stable, more info can be found in...
@anderspitman