inference Ascend support

Is your feature request related to a problem? Please describe

No

Describe the solution you'd like

As Ascend from Huawei is getting more and more attention in China market. Please support Ascend in xInference.

Describe alternatives you've considered

May be integrate FastChat as backend is a shortcut to implement this feature. FastChat announced it supports Ascend NPU. And according to our test, FastChat DOES support Ascend. FastChat also support exllamav2 as a plus for CUDA architecture.

Additional context

SOE in China was asked to adapt local AI hardware vendor such as Ascend, 寒武纪, etc.

Apr 18 '24 05:04 Tint0ri

Which model did you launch for FastChat on Ascend?

Apr 18 '24 09:04 qinxuye

Baichuan2 and Qwen1.5. Looks like Qwen has concurrent issue on Ascend. Baichuan2 works fine.

Apr 18 '24 12:04 Tint0ri

Baichuan2 and Qwen1.5. Looks like Qwen has concurrent issue on Ascend. Baichuan2 works fine.

Do you mean Qwen1.5 has concurrent issue?

Apr 23 '24 07:04 qinxuye

According to our test on fastchat with Ascend 310B, sometime the output will be messed up with concurrent input. Test on Qwen1.5-14B.

Apr 27 '24 14:04 Tint0ri

Ascend support is introduced in #1408, I tested baichuan-2 and qwen.

Apr 29 '24 07:04 qinxuye