bumblebee
bumblebee copied to clipboard
Add Qwen3 model support
Add Qwen3 Model Family Support
Summary
This PR adds comprehensive support for the Qwen3 model family from Alibaba Cloud, including text generation, embeddings, and reranking models. Qwen3 is a state-of-the-art multilingual language model with advanced features like QK normalization and support for up to 262K context length.
What's New
- Qwen3 Text Generation Models
Architectures:
- :base - Base Qwen3 model
- :for_causal_language_modeling - Text generation
- :for_sequence_classification - Classification tasks
- :for_embedding - Text embeddings (new)
Key Features:
- QK Normalization: RMS normalization on query and key projections for improved training stability (Qwen3-specific innovation)
- Grouped Query Attention (GQA): 32 query heads with 8 key-value heads for efficient inference
- Extended Context: Supports up to 262,144 tokens
- High RoPE Theta: 5,000,000 base frequency (vs typical 10,000) for better long-context performance
- Large Vocabulary: 151,936 tokens for multilingual support
- Gated FFN: SwiGLU activation
- Qwen3-Embedding Support
- Last Token Pooling: Added :last_token_pooling option to Bumblebee.Text.text_embedding/3
- Instruction-Aware: Supports custom task instructions (improves performance by 1-5% per Qwen team)
- Multilingual: Over 100 languages supported
- Flexible Dimensions: 1024-dim (0.6B), 2560-dim (4B), 4096-dim (8B)
- Qwen3-Reranker Support
- Document Reranking: Score query-document pairs for relevance (0-1 range)
- Custom Instructions: Task-specific prompts for better performance
- High Accuracy: Relevant docs score 0.99+, irrelevant docs score near 0.0
Files Changed
Core Implementation:
- lib/bumblebee/text/qwen3.ex (730 lines) - Full Qwen3 model implementation
- lib/bumblebee.ex - Model and tokenizer registrations
- lib/bumblebee/text/text_embedding.ex - Added last token pooling
Examples:
- examples/README.md - Example documentation
- examples/qwen3.exs - Text generation example
- examples/qwen3_embedding.exs - Embedding generation
- examples/qwen3_embedding_prompts.exs - Instruction-aware embeddings
- examples/qwen3_reranker.exs - Document reranking
Documentation:
- QWEN3_IEX_GUIDE.md - Interactive IEx usage guide
- .gitignore - Added .lexical/
Testing
Text Generation (Qwen3-4B-Instruct)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, config} = Bumblebee.load_generation_config({:hf, "Qwen/Qwen3-4B-Instruct-2507"})
serving = Bumblebee.Text.generation(model, tokenizer, config) Nx.Serving.run(serving, "The future of AI")
Results: Generates coherent English text, answers questions correctly, creates stories and code.
Text Embeddings (Qwen3-Embedding-0.6B)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Embedding-0.6B"}, architecture: :for_embedding ) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})
serving = Bumblebee.Text.text_embedding(model, tokenizer, output_attribute: :embedding, embedding_processor: :l2_norm )
e1 = Nx.Serving.run(serving, "The cat sat on the mat") e2 = Nx.Serving.run(serving, "A feline rested on the rug") Nx.dot(e1.embedding, e2.embedding) |> Nx.to_number() # 0.73 (similar)
Results:
- Generates 1024-dim normalized vectors
- Semantic similarity: Similar texts = 0.72, different texts = 0.34
- Instruction prompts improve relevance by ~5%
Reranking (Qwen3-Reranker-0.6B)
{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Reranker-0.6B"})
Score query-document relevance
Relevant: 0.99+, Irrelevant: ~0.0
Results: Correctly ranks documents by relevance to queries.
Compatible Models
Text Generation:
- Qwen/Qwen3-0.6B โ Qwen/Qwen3-32B
- Qwen/Qwen3-4B-Instruct-2507 (recommended)
Embeddings:
- Qwen/Qwen3-Embedding-0.6B (1024-dim)
- Qwen/Qwen3-Embedding-4B (2560-dim)
- Qwen/Qwen3-Embedding-8B (4096-dim)
Reranking:
- Qwen/Qwen3-Reranker-0.6B
- Qwen/Qwen3-Reranker-4B
- Qwen/Qwen3-Reranker-8B
Technical Implementation
QK Normalization
Unlike standard transformers, Qwen3 applies RMS normalization to query and key states: hidden -> dense -> split_heads -> rms_norm -> rotary -> attention
Architecture Support
Custom decoder blocks implement QK normalization while maintaining compatibility with Bumblebee's transformer patterns.
Embedding Architecture
New :for_embedding architecture automatically pools the last non-padding token for text embedding tasks.
Reranking
Uses the causal LM architecture with yes/no token logit extraction and softmax scoring.
Breaking Changes
None. This is purely additive.
References
- https://qwenlm.github.io/blog/qwen3/
- https://qwenlm.github.io/blog/qwen3-embedding/
- https://huggingface.co/collections/Qwen/qwen3-66850ac008e23f2e87b68084
- https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/qwen3/modeling_qwen3.py
I will test it tomorrow with my h200 to be sure that everything is working. With my mbr the answers seems ok, but the generation is slow. My end goal is to add support for the embeddings and rerankers from qwen. Also comments are really welcome, i generated most of it with sonnet 4.5.
I was interested in getting a qwen3 vision model working like https://huggingface.co/huihui-ai/Huihui-MiniCPM-V-4_5-abliterated
Generation looking good!
iex(16)> prompt = """
...(16)> <|im_start|>system
...(16)> You are a helpful assistant.<|im_end|>
...(16)> <|im_start|>user
...(16)> What is the capital of France?<|im_end|>
...(16)> <|im_start|>assistant
...(16)> """
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"
iex(17)>
nil
iex(18)> result = Nx.Serving.run(serving, prompt)
%{
results: [
%{
text: "The capital of France is Paris.",
token_summary: %{input: 26, output: 8, padding: 0}
}
]
}
Still more tests to do and write!
@jonatanklosko i used a light model with qwen3 arch to write some basic tests similar to the other PR. Let me know if this is enough.
Sorry, worked caught up with me, I will continue the PR this weekend.
@jonatanklosko I managed to get some time. I compared the elixir implementation with Python transformer's
๐งช Test Environment
- Python: transformers 4.57.1, torch 2.9.0 (bf16)
- Elixir: Bumblebee (local), Nx 0.10.0, EXLA 0.10.0 (bf16)
- Model:
Qwen/Qwen3-Embedding-0.6B - Platform: macOS ARM64
Test 1: Basic Text ("hello!")
| Metric | Python (transformers) | Elixir (Bumblebee) | Difference |
|---|---|---|---|
| Norm | 0.9961 | 0.9998 | 0.0037 |
| Cosine Similarity | โ | โ | 0.9998 โ |
| Mean Abs Diff | โ | โ | 0.00053 |
| Max Abs Diff | โ | โ | 0.0027 |
First 10 embedding values
Index Python Elixir Abs Diff
0 0.0004043579 0.0005552031 0.0001508
1 -0.0277099609 -0.0279187821 0.0002088
2 -0.0111694336 -0.0111040613 0.0000654
3 -0.0184326172 -0.0174492393 0.0009834
4 -0.0209960938 -0.0209390856 0.0000570
5 0.0031738281 0.0026967004 0.0004771
6 -0.0356445312 -0.0361675136 0.0005230
7 0.0869140625 0.0869289339 0.0000149
8 -0.0446777344 -0.0447335020 0.0000558
9 -0.0195312500 -0.0200666245 0.0005354
Test 2: Query with Instruction Format
Text: "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:hello!"
| Metric | Python (transformers) | Elixir (Bumblebee) | Difference |
|---|---|---|---|
| Norm | 1.0000 | 1.0042 | 0.0042 |
| Cosine Similarity | โ | โ | 0.9992 โ |
| Mean Abs Diff | โ | โ | 0.00096 |
| Max Abs Diff | โ | โ | 0.0051 |
First 10 embedding values
Index Python Elixir Abs Diff
0 0.0043945312 0.0041164518 0.0002781
1 -0.0073242188 -0.0055703474 0.0017539
2 -0.0057067871 -0.0059557175 0.0002489
3 -0.0380859375 -0.0409192815 0.0028333
4 -0.0319824219 -0.0322309434 0.0002485
5 -0.0294189453 -0.0298486538 0.0004297
6 -0.0546875000 -0.0546524674 0.0000350
7 0.0268554688 0.0284473095 0.0015918
8 -0.0693359375 -0.0681053847 0.0012306
9 -0.0324707031 -0.0281670410 0.0043037
Do you think the small difference is because of bf16 / language implementation of float?
# Load model
{:ok, model_info} = Bumblebee.load_model(
{:hf, "Qwen/Qwen3-Embedding-0.6B"},
type: :bf16,
architecture: :for_embedding
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})
# Create serving
serving = Bumblebee.Text.TextEmbedding.text_embedding(
model_info,
tokenizer,
output_attribute: :embedding,
embedding_processor: :l2_norm
)
# For documents (no instruction)
document = "The capital of China is Beijing."
%{embedding: doc_emb} = Nx.Serving.run(serving, document)
# For queries (NO space after Query:)
task = "Given a web search query, retrieve relevant passages that answer the query"
query = "Instruct: #{task}\nQuery:What is the capital of China?"
%{embedding: query_emb} = Nx.Serving.run(serving, query)
# Compute similarity
similarity = Nx.dot(query_emb, doc_emb)
@jonatanklosko finally find again some time! i feel i addressed all the comments. Sorry for this large PR.