vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Model] Deepseek GGUF support

Open SzymonOzog opened this issue 2 weeks ago • 18 comments

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=8, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues: Model loading is very slow as we load experts one by one GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this

SzymonOzog avatar Feb 12 '25 16:02 SzymonOzog