llm-inference issues

Enhance inference API to support OpenAI style

2

Enhance inference API to support OpenAI style

SeanHH86

enhancement

vllm implements cannot support download model from repo besides hg

2

``` initialization: runtime_env: env_vars: HF_ENDPOINT: https://hub.opencsg.com/hf initializer: type: Vllm from_pretrained_kwargs: trust_remote_code: true pipeline: vllm ``` this cannot work as expected

depenglee1707

GGUF implements will make duplicate copy since cannot detect config.json file in the cache folder

depenglee1707

Add inference SDK for invoke

Add python sdk for inference api

SeanHH86

enhancement

Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq

1

Run Qwen1.5-72B-Chat-GPTQ-Int4 is much slower than Qwen1.5-72B-Chat by transformer package. Quantited model need load by auto_gptq. https://github.com/QwenLM/Qwen/blob/main/README_CN.md#%E6%8E%A8%E7%90%86%E6%80%A7%E8%83%BD

SeanHH86

enhancement

Requested tokens (817) exceed context window of 512

3

(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': [''], 'logits_processor': [], 'stopping_criteria': []} (ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica...

SeanHH86

bug

Install dependency llama-cpp-python failed

4

``` Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB) Building wheels for collected packages: deepspeed, llama-cpp-python, llm-serve, ffmpy Building wheel for deepspeed (setup.py) ... done Created wheel for deepspeed: filename=deepspeed-0.14.0-py3-none-any.whl size=1400347 sha256=db3cabb92e930a4d76b2adf48e2bae802dc28c333d54d790ab2b4256efe03fe0 Stored...

SeanHH86

API server startup slow

Too much yaml file need to be load during api server startup

SeanHH86

bug

Model inference cross multi-nodes

Model inference cross mulit-nodes

SeanHH86

Support Quantized Model

1

Support Quantized Model. For example: https://huggingface.co/THUDM/chatglm2-6b-int4 https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GPTQ-Int4

SeanHH86

llm-inference
llm-inference copied to clipboard

Metadata

Enhance inference API to support OpenAI style

vllm implements cannot support download model from repo besides hg

GGUF implements will make duplicate copy since cannot detect config.json file in the cache folder

Add inference SDK for invoke

Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq

Requested tokens (817) exceed context window of 512

Install dependency llama-cpp-python failed

API server startup slow

Model inference cross multi-nodes

Support Quantized Model

← Metadata

Owner

Metadata

llm-inference llm-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

llm-inference
llm-inference copied to clipboard