vllm
vllm copied to clipboard
[Model] Deepseek GGUF support
This adds support for quantized deepseek versions from Unsloth:
Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.
To run at the moment one needs to:
- download the tokenizer, configuration and modeling files from the original deepseek repo and the config.json from Unsloth GGUF repo.
- Change the torch_dtype in config to float16
- Merge the weights as instructed in the vLLM docs
When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:
from vllm import LLM, SamplingParams
llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
enforce_eager=True, tensor_parallel_size=8, trust_remote_code=True,
max_model_len=10000)
sampling_params = SamplingParams(temperature=0.5, max_tokens=2)
def print_outputs(outputs):
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
print("-" * 80)
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Why did the Roman Empire fall?",
},
]
outputs = llm.chat(conversation,
sampling_params=sampling_params,
use_tqdm=False)
print_outputs(outputs)
Current issues: Model loading is very slow as we load experts one by one GGUF MoE is a very naive implementation and is very slow
I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this