trafficstars

Add Qwen3 Model Family Support

Summary

This PR adds comprehensive support for the Qwen3 model family from Alibaba Cloud, including text generation, embeddings, and reranking models. Qwen3 is a state-of-the-art multilingual language model with advanced features like QK normalization and support for up to 262K context length.

What's New

Qwen3 Text Generation Models

Architectures:

:base - Base Qwen3 model
:for_causal_language_modeling - Text generation
:for_sequence_classification - Classification tasks
:for_embedding - Text embeddings (new)

Key Features:

QK Normalization: RMS normalization on query and key projections for improved training stability (Qwen3-specific innovation)
Grouped Query Attention (GQA): 32 query heads with 8 key-value heads for efficient inference
Extended Context: Supports up to 262,144 tokens
High RoPE Theta: 5,000,000 base frequency (vs typical 10,000) for better long-context performance
Large Vocabulary: 151,936 tokens for multilingual support
Gated FFN: SwiGLU activation

Qwen3-Embedding Support

Last Token Pooling: Added :last_token_pooling option to Bumblebee.Text.text_embedding/3
Instruction-Aware: Supports custom task instructions (improves performance by 1-5% per Qwen team)
Multilingual: Over 100 languages supported
Flexible Dimensions: 1024-dim (0.6B), 2560-dim (4B), 4096-dim (8B)

Qwen3-Reranker Support

Document Reranking: Score query-document pairs for relevance (0-1 range)
Custom Instructions: Task-specific prompts for better performance
High Accuracy: Relevant docs score 0.99+, irrelevant docs score near 0.0

Files Changed

Core Implementation:

lib/bumblebee/text/qwen3.ex (730 lines) - Full Qwen3 model implementation
lib/bumblebee.ex - Model and tokenizer registrations
lib/bumblebee/text/text_embedding.ex - Added last token pooling

Examples:

examples/README.md - Example documentation
examples/qwen3.exs - Text generation example
examples/qwen3_embedding.exs - Embedding generation
examples/qwen3_embedding_prompts.exs - Instruction-aware embeddings
examples/qwen3_reranker.exs - Document reranking

Documentation:

QWEN3_IEX_GUIDE.md - Interactive IEx usage guide
.gitignore - Added .lexical/

Testing

Text Generation (Qwen3-4B-Instruct)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, config} = Bumblebee.load_generation_config({:hf, "Qwen/Qwen3-4B-Instruct-2507"})

serving = Bumblebee.Text.generation(model, tokenizer, config) Nx.Serving.run(serving, "The future of AI")

Results: Generates coherent English text, answers questions correctly, creates stories and code.

Text Embeddings (Qwen3-Embedding-0.6B)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Embedding-0.6B"}, architecture: :for_embedding ) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})

serving = Bumblebee.Text.text_embedding(model, tokenizer, output_attribute: :embedding, embedding_processor: :l2_norm )

e1 = Nx.Serving.run(serving, "The cat sat on the mat") e2 = Nx.Serving.run(serving, "A feline rested on the rug") Nx.dot(e1.embedding, e2.embedding) |> Nx.to_number() # 0.73 (similar)

Results:

Generates 1024-dim normalized vectors
Semantic similarity: Similar texts = 0.72, different texts = 0.34
Instruction prompts improve relevance by ~5%

Reranking (Qwen3-Reranker-0.6B)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Reranker-0.6B"})

Score query-document relevance

Relevant: 0.99+, Irrelevant: ~0.0

Results: Correctly ranks documents by relevance to queries.

Compatible Models

Text Generation:

Qwen/Qwen3-0.6B → Qwen/Qwen3-32B
Qwen/Qwen3-4B-Instruct-2507 (recommended)

Embeddings:

Qwen/Qwen3-Embedding-0.6B (1024-dim)
Qwen/Qwen3-Embedding-4B (2560-dim)
Qwen/Qwen3-Embedding-8B (4096-dim)

Reranking:

Qwen/Qwen3-Reranker-0.6B
Qwen/Qwen3-Reranker-4B
Qwen/Qwen3-Reranker-8B

Technical Implementation

QK Normalization

Unlike standard transformers, Qwen3 applies RMS normalization to query and key states: hidden -> dense -> split_heads -> rms_norm -> rotary -> attention

Architecture Support

Custom decoder blocks implement QK normalization while maintaining compatibility with Bumblebee's transformer patterns.

Embedding Architecture

New :for_embedding architecture automatically pools the last non-padding token for text embedding tasks.

Reranking

Uses the causal LM architecture with yes/no token logit extraction and softmax scoring.

Breaking Changes

None. This is purely additive.

References

https://qwenlm.github.io/blog/qwen3/
https://qwenlm.github.io/blog/qwen3-embedding/
https://huggingface.co/collections/Qwen/qwen3-66850ac008e23f2e87b68084
https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/qwen3/modeling_qwen3.py

Oct 05 '25 13:10 nyo16

I will test it tomorrow with my h200 to be sure that everything is working. With my mbr the answers seems ok, but the generation is slow. My end goal is to add support for the embeddings and rerankers from qwen. Also comments are really welcome, i generated most of it with sonnet 4.5.

Oct 05 '25 13:10 nyo16

I was interested in getting a qwen3 vision model working like https://huggingface.co/huihui-ai/Huihui-MiniCPM-V-4_5-abliterated

Oct 05 '25 22:10 fire

Generation looking good!

iex(16)>   prompt = """
...(16)>   <|im_start|>system
...(16)>   You are a helpful assistant.<|im_end|>
...(16)>   <|im_start|>user
...(16)>   What is the capital of France?<|im_end|>
...(16)>   <|im_start|>assistant
...(16)>   """
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"
iex(17)>
nil
iex(18)>   result = Nx.Serving.run(serving, prompt)
%{
  results: [
    %{
      text: "The capital of France is Paris.",
      token_summary: %{input: 26, output: 8, padding: 0}
    }
  ]
}

Still more tests to do and write!

Oct 06 '25 23:10 nyo16

@jonatanklosko i used a light model with qwen3 arch to write some basic tests similar to the other PR. Let me know if this is enough.

Oct 09 '25 10:10 nyo16

Sorry, worked caught up with me, I will continue the PR this weekend.

Oct 22 '25 23:10 nyo16

@jonatanklosko I managed to get some time. I compared the elixir implementation with Python transformer's

🧪 Test Environment

Python: transformers 4.57.1, torch 2.9.0 (bf16)
Elixir: Bumblebee (local), Nx 0.10.0, EXLA 0.10.0 (bf16)
Model: Qwen/Qwen3-Embedding-0.6B
Platform: macOS ARM64

Test 1: Basic Text (`"hello!"`)

Metric	Python (transformers)	Elixir (Bumblebee)	Difference
Norm	0.9961	0.9998	0.0037
Cosine Similarity	—	—	0.9998 ✅
Mean Abs Diff	—	—	0.00053
Max Abs Diff	—	—	0.0027

First 10 embedding values

Index  Python              Elixir              Abs Diff
0      0.0004043579        0.0005552031        0.0001508
1     -0.0277099609       -0.0279187821        0.0002088
2     -0.0111694336       -0.0111040613        0.0000654
3     -0.0184326172       -0.0174492393        0.0009834
4     -0.0209960938       -0.0209390856        0.0000570
5      0.0031738281        0.0026967004        0.0004771
6     -0.0356445312       -0.0361675136        0.0005230
7      0.0869140625        0.0869289339        0.0000149
8     -0.0446777344       -0.0447335020        0.0000558
9     -0.0195312500       -0.0200666245        0.0005354

Test 2: Query with Instruction Format

Text: "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:hello!"

Metric	Python (transformers)	Elixir (Bumblebee)	Difference
Norm	1.0000	1.0042	0.0042
Cosine Similarity	—	—	0.9992 ✅
Mean Abs Diff	—	—	0.00096
Max Abs Diff	—	—	0.0051

First 10 embedding values

Index  Python              Elixir              Abs Diff
0      0.0043945312        0.0041164518        0.0002781
1     -0.0073242188       -0.0055703474        0.0017539
2     -0.0057067871       -0.0059557175        0.0002489
3     -0.0380859375       -0.0409192815        0.0028333
4     -0.0319824219       -0.0322309434        0.0002485
5     -0.0294189453       -0.0298486538        0.0004297
6     -0.0546875000       -0.0546524674        0.0000350
7      0.0268554688        0.0284473095        0.0015918
8     -0.0693359375       -0.0681053847        0.0012306
9     -0.0324707031       -0.0281670410        0.0043037

Do you think the small difference is because of bf16 / language implementation of float?

# Load model
{:ok, model_info} = Bumblebee.load_model(
  {:hf, "Qwen/Qwen3-Embedding-0.6B"},
  type: :bf16,
  architecture: :for_embedding
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})

# Create serving
serving = Bumblebee.Text.TextEmbedding.text_embedding(
  model_info,
  tokenizer,
  output_attribute: :embedding,
  embedding_processor: :l2_norm
)

# For documents (no instruction)
document = "The capital of China is Beijing."
%{embedding: doc_emb} = Nx.Serving.run(serving, document)

# For queries (NO space after Query:)
task = "Given a web search query, retrieve relevant passages that answer the query"
query = "Instruct: #{task}\nQuery:What is the capital of China?"
%{embedding: query_emb} = Nx.Serving.run(serving, query)

# Compute similarity
similarity = Nx.dot(query_emb, doc_emb)

Nov 04 '25 23:11 nyo16

@jonatanklosko finally find again some time! i feel i addressed all the comments. Sorry for this large PR.

Nov 16 '25 22:11 nyo16

bumblebee
bumblebee copied to clipboard

Add Qwen3 model support

Score query-document relevance

Relevant: 0.99+, Irrelevant: ~0.0

🧪 Test Environment

Test 1: Basic Text (`"hello!"`)

Test 2: Query with Instruction Format

Do you think the small difference is because of bf16 / language implementation of float?

bumblebee bumblebee copied to clipboard

Add Qwen3 model support

Score query-document relevance

Relevant: 0.99+, Irrelevant: ~0.0

🧪 Test Environment

Test 1: Basic Text ("hello!")

Test 2: Query with Instruction Format

Do you think the small difference is because of bf16 / language implementation of float?

bumblebee
bumblebee copied to clipboard

Test 1: Basic Text (`"hello!"`)