vllm [Model] 1.58bits BitNet Model Support

This pull request is a follow-up to PR #6036. In this PR, we introduce the BitNet model and provide an efficient inference kernel with the BitBLAS backend. Here are the performance benchmarks:

Model	Framework	BS16IN32OUT128	BS1IN512OUT1024	B32IN32OUT128
BitNet-3B-1.58bits	PyTorch	106.83	49.34	209.03
BitNet-3B-1.58bits	PyTorch-BitBLAS	240.33	103.09	493.31
BitNet-3B-1.58bits	vLLM-BitBLAS	379.25	117.43	752.55
BitNet-3B-1.58bits	vLLM-BitBLAS-CUDA-Graph	2543.58	1621.08	2731.79

To answer the question raised by @mgoin in PR #6036, I believe a new BitNet model is necessary because the open-source BitNet implementation provides a unique tokenizer and model architecture, which includes an additional RMS layer compared to LLaMA. Additionally, the BitNet integration example with llama.cpp also introduces a new model architecture (refer to: llama.cpp.pr.7931).

Example Usage:

from conftest import VllmRunner

# Test BitNET model with BitBLAS quantization
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B",
    dtype="half",
    quantization="bitnet_bitblas",
    enforce_eager=True,
    gpu_memory_utilization=0.5,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitnet_bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

# Test another BitBLAS model
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

[ ] pr #6036 should be merged

Aug 21 '24 08:08 LeiWang1999

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 21 '24 08:08 github-actions[bot]

Sorry for the long delay, @mgoin can you follow up on this and the previous PR?

A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.

Sep 13 '24 17:09 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 12 '24 21:11 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Feb 25 '25 02:02 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Mar 27 '25 02:03 github-actions[bot]

Hi, any progress?

i noticed that there is a new model https://huggingface.co/microsoft/bitnet-b1.58-2B-4T

Model Variants
Several versions of the model weights are available on Hugging Face:

[microsoft/bitnet-b1.58-2B-4T](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) (This repository): Contains the packed 1.58-bit weights optimized for efficient inference. Use this for deployment.

[microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16): Contains the master weights in BF16 format. Use this only for training or fine-tuning purposes.

[microsoft/bitnet-b1.58-2B-4T-gguf](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the bitnet.cpp library for CPU inference.

is it possible to support models with b1.58?

Apr 19 '25 19:04 hbj52152