llm-vscode-inference-server
llm-vscode-inference-server copied to clipboard
An endpoint server for efficiently serving quantized open-source LLMs for code.
I saw the README references running on CPU as a goal, is the project there right now or is there still work to be done to achieve that? Currently I'm...
- Already posted on https://github.com/vllm-project/vllm/issues/1479 - My GPU is RTX 3060 with 12GB VRAM - My target model is[CodeLlama-7B-AWQ](https://huggingface.co/TheBloke/CodeLlama-7B-AWQ), which size is
I keep getting fim tokens when it responds back, am I supposed to scrub this directly in the code or is there some setting that has to be used in...
While i build a service with docker, the error is raised. **output** ```291.9 RuntimeError: 291.9 The detected CUDA version (12.1) mismatches the version that was used to compile 291.9 PyTorch...