[Model] 1.58bits BitNet Model Support
This pull request is a follow-up to PR #6036. In this PR, we introduce the BitNet model and provide an efficient inference kernel with the BitBLAS backend. Here are the performance benchmarks:
| Model | Framework | BS16IN32OUT128 | BS1IN512OUT1024 | B32IN32OUT128 |
|---|---|---|---|---|
| BitNet-3B-1.58bits | PyTorch | 106.83 | 49.34 | 209.03 |
| BitNet-3B-1.58bits | PyTorch-BitBLAS | 240.33 | 103.09 | 493.31 |
| BitNet-3B-1.58bits | vLLM-BitBLAS | 379.25 | 117.43 | 752.55 |
| BitNet-3B-1.58bits | vLLM-BitBLAS-CUDA-Graph | 2543.58 | 1621.08 | 2731.79 |
To answer the question raised by @mgoin in PR #6036, I believe a new BitNet model is necessary because the open-source BitNet implementation provides a unique tokenizer and model architecture, which includes an additional RMS layer compared to LLaMA. Additionally, the BitNet integration example with llama.cpp also introduces a new model architecture (refer to: llama.cpp.pr.7931).
Example Usage:
from conftest import VllmRunner
# Test BitNET model with BitBLAS quantization
with VllmRunner(
"hxbgsyxh/bitnet_b1_58-3B",
dtype="half",
quantization="bitnet_bitblas",
enforce_eager=True,
gpu_memory_utilization=0.5,
) as bitnet_model:
bitbnet_outputs = bitnet_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
print("bitnet_bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1])
# Test another BitBLAS model
with VllmRunner(
"hxbgsyxh/bitnet_b1_58-3B_bitblas",
dtype="half",
quantization="bitblas",
enforce_eager=True,
) as bitnet_model:
bitbnet_outputs = bitnet_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
print("bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1])
- [ ] pr #6036 should be merged
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
Sorry for the long delay, @mgoin can you follow up on this and the previous PR?
A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LeiWang1999.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!
Hi, any progress?
i noticed that there is a new model https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
Model Variants
Several versions of the model weights are available on Hugging Face:
[microsoft/bitnet-b1.58-2B-4T](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T) (This repository): Contains the packed 1.58-bit weights optimized for efficient inference. Use this for deployment.
[microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16): Contains the master weights in BF16 format. Use this only for training or fine-tuning purposes.
[microsoft/bitnet-b1.58-2B-4T-gguf](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the bitnet.cpp library for CPU inference.
is it possible to support models with b1.58?