llm-inference
llm-inference copied to clipboard
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource...
Enhance inference API to support OpenAI style
``` initialization: runtime_env: env_vars: HF_ENDPOINT: https://hub.opencsg.com/hf initializer: type: Vllm from_pretrained_kwargs: trust_remote_code: true pipeline: vllm ``` this cannot work as expected
Run Qwen1.5-72B-Chat-GPTQ-Int4 is much slower than Qwen1.5-72B-Chat by transformer package. Quantited model need load by auto_gptq. https://github.com/QwenLM/Qwen/blob/main/README_CN.md#%E6%8E%A8%E7%90%86%E6%80%A7%E8%83%BD
(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': [''], 'logits_processor': [], 'stopping_criteria': []} (ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica...
``` Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB) Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy Building wheel for deepspeed (setup.py) ... done Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0 Stored...
Model inference cross mulit-nodes
Support Quantized Model. For example: https://huggingface.co/THUDM/chatglm2-6b-int4 https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4