vllm [Model] Deepseek GGUF support

[Model] Deepseek GGUF support

Open SzymonOzog opened this issue 2 weeks ago • 18 comments

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

download the tokenizer, configuration and modeling files from the original deepseek repo and the config.json from Unsloth GGUF repo.
Change the torch_dtype in config to float16
Merge the weights as instructed in the vLLM docs

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=8, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues: Model loading is very slow as we load experts one by one GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this

Feb 12 '25 16:02 SzymonOzog

vllm vllm copied to clipboard

[Model] Deepseek GGUF support

vllm
vllm copied to clipboard