provide support for model serving using FastAPI deepspeed+ipex-llm

Open nazneenn opened this issue 1 year ago • 3 comments

Hi, Could you please help provide guide on integrating deepspeed approach of using multi-GPU Intel Flex 140 to run model inference using FastAPI and uvicorn setting ? model id: 'meta-llama/Llama-2-7b-chat-hf' Thanks

Apr 08 '24 05:04 nazneenn

Hi @nazneenn , we are developing a poc of FastAPI serving using multi-GPU, will keep you updated.

Apr 09 '24 02:04 glorysdj

Watching this one - I'll be aiming to run Mixtral 8x7b AWQ on a pair of Arc A770s (I'll be buying the second GPU as soon as I know it's supported).

Apr 09 '24 12:04 digitalscream

Hi @nazneenn @digitalscream FastAPI serving using multi-GPU is now supported in ipex-llm, please refer to this example https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI

Apr 17 '24 01:04 glorysdj