vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Model] 1.58bits BitNet Model Support

Open LeiWang1999 opened this issue 1 year ago • 2 comments

This pull request is a follow-up to PR #6036. In this PR, we introduce the BitNet model and provide an efficient inference kernel with the BitBLAS backend. Here are the performance benchmarks:

Model Framework BS16IN32OUT128 BS1IN512OUT1024 B32IN32OUT128
BitNet-3B-1.58bits PyTorch 106.83 49.34 209.03
BitNet-3B-1.58bits PyTorch-BitBLAS 240.33 103.09 493.31
BitNet-3B-1.58bits vLLM-BitBLAS 379.25 117.43 752.55
BitNet-3B-1.58bits vLLM-BitBLAS-CUDA-Graph 2543.58 1621.08 2731.79

To answer the question raised by @mgoin in PR #6036, I believe a new BitNet model is necessary because the open-source BitNet implementation provides a unique tokenizer and model architecture, which includes an additional RMS layer compared to LLaMA. Additionally, the BitNet integration example with llama.cpp also introduces a new model architecture (refer to: llama.cpp.pr.7931).

Example Usage:

from conftest import VllmRunner

# Test BitNET model with BitBLAS quantization
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B",
    dtype="half",
    quantization="bitnet_bitblas",
    enforce_eager=True,
    gpu_memory_utilization=0.5,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitnet_bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

# Test another BitBLAS model
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])
  • [ ] pr #6036 should be merged

LeiWang1999 avatar Aug 21 '24 08:08 LeiWang1999

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Aug 21 '24 08:08 github-actions[bot]

Sorry for the long delay, @mgoin can you follow up on this and the previous PR?

A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.

DarkLight1337 avatar Sep 13 '24 17:09 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 12 '24 21:11 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions[bot] avatar Feb 25 '25 02:02 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

github-actions[bot] avatar Mar 27 '25 02:03 github-actions[bot]

Hi, any progress?

i noticed that there is a new model https://huggingface.co/microsoft/bitnet-b1.58-2B-4T

Model Variants
Several versions of the model weights are available on Hugging Face:

[microsoft/bitnet-b1.58-2B-4T](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) (This repository): Contains the packed 1.58-bit weights optimized for efficient inference. Use this for deployment.

[microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16): Contains the master weights in BF16 format. Use this only for training or fine-tuning purposes.

[microsoft/bitnet-b1.58-2B-4T-gguf](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the bitnet.cpp library for CPU inference.

is it possible to support models with b1.58?

hbj52152 avatar Apr 19 '25 19:04 hbj52152