bumblebee icon indicating copy to clipboard operation
bumblebee copied to clipboard

Add Qwen3 model support

Open nyo16 opened this issue 1 month ago โ€ข 6 comments
trafficstars

Add Qwen3 Model Family Support

Summary

This PR adds comprehensive support for the Qwen3 model family from Alibaba Cloud, including text generation, embeddings, and reranking models. Qwen3 is a state-of-the-art multilingual language model with advanced features like QK normalization and support for up to 262K context length.

What's New

  1. Qwen3 Text Generation Models

Architectures:

  • :base - Base Qwen3 model
  • :for_causal_language_modeling - Text generation
  • :for_sequence_classification - Classification tasks
  • :for_embedding - Text embeddings (new)

Key Features:

  • QK Normalization: RMS normalization on query and key projections for improved training stability (Qwen3-specific innovation)
  • Grouped Query Attention (GQA): 32 query heads with 8 key-value heads for efficient inference
  • Extended Context: Supports up to 262,144 tokens
  • High RoPE Theta: 5,000,000 base frequency (vs typical 10,000) for better long-context performance
  • Large Vocabulary: 151,936 tokens for multilingual support
  • Gated FFN: SwiGLU activation
  1. Qwen3-Embedding Support
  • Last Token Pooling: Added :last_token_pooling option to Bumblebee.Text.text_embedding/3
  • Instruction-Aware: Supports custom task instructions (improves performance by 1-5% per Qwen team)
  • Multilingual: Over 100 languages supported
  • Flexible Dimensions: 1024-dim (0.6B), 2560-dim (4B), 4096-dim (8B)
  1. Qwen3-Reranker Support
  • Document Reranking: Score query-document pairs for relevance (0-1 range)
  • Custom Instructions: Task-specific prompts for better performance
  • High Accuracy: Relevant docs score 0.99+, irrelevant docs score near 0.0

Files Changed

Core Implementation:

  • lib/bumblebee/text/qwen3.ex (730 lines) - Full Qwen3 model implementation
  • lib/bumblebee.ex - Model and tokenizer registrations
  • lib/bumblebee/text/text_embedding.ex - Added last token pooling

Examples:

  • examples/README.md - Example documentation
  • examples/qwen3.exs - Text generation example
  • examples/qwen3_embedding.exs - Embedding generation
  • examples/qwen3_embedding_prompts.exs - Instruction-aware embeddings
  • examples/qwen3_reranker.exs - Document reranking

Documentation:

  • QWEN3_IEX_GUIDE.md - Interactive IEx usage guide
  • .gitignore - Added .lexical/

Testing

Text Generation (Qwen3-4B-Instruct)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-4B-Instruct-2507"}) {:ok, config} = Bumblebee.load_generation_config({:hf, "Qwen/Qwen3-4B-Instruct-2507"})

serving = Bumblebee.Text.generation(model, tokenizer, config) Nx.Serving.run(serving, "The future of AI")

Results: Generates coherent English text, answers questions correctly, creates stories and code.

Text Embeddings (Qwen3-Embedding-0.6B)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Embedding-0.6B"}, architecture: :for_embedding ) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})

serving = Bumblebee.Text.text_embedding(model, tokenizer, output_attribute: :embedding, embedding_processor: :l2_norm )

e1 = Nx.Serving.run(serving, "The cat sat on the mat") e2 = Nx.Serving.run(serving, "A feline rested on the rug") Nx.dot(e1.embedding, e2.embedding) |> Nx.to_number() # 0.73 (similar)

Results:

  • Generates 1024-dim normalized vectors
  • Semantic similarity: Similar texts = 0.72, different texts = 0.34
  • Instruction prompts improve relevance by ~5%

Reranking (Qwen3-Reranker-0.6B)

{:ok, model} = Bumblebee.load_model({:hf, "Qwen/Qwen3-Reranker-0.6B"})

Score query-document relevance

Relevant: 0.99+, Irrelevant: ~0.0

Results: Correctly ranks documents by relevance to queries.

Compatible Models

Text Generation:

  • Qwen/Qwen3-0.6B โ†’ Qwen/Qwen3-32B
  • Qwen/Qwen3-4B-Instruct-2507 (recommended)

Embeddings:

  • Qwen/Qwen3-Embedding-0.6B (1024-dim)
  • Qwen/Qwen3-Embedding-4B (2560-dim)
  • Qwen/Qwen3-Embedding-8B (4096-dim)

Reranking:

  • Qwen/Qwen3-Reranker-0.6B
  • Qwen/Qwen3-Reranker-4B
  • Qwen/Qwen3-Reranker-8B

Technical Implementation

QK Normalization

Unlike standard transformers, Qwen3 applies RMS normalization to query and key states: hidden -> dense -> split_heads -> rms_norm -> rotary -> attention

Architecture Support

Custom decoder blocks implement QK normalization while maintaining compatibility with Bumblebee's transformer patterns.

Embedding Architecture

New :for_embedding architecture automatically pools the last non-padding token for text embedding tasks.

Reranking

Uses the causal LM architecture with yes/no token logit extraction and softmax scoring.

Breaking Changes

None. This is purely additive.

References

  • https://qwenlm.github.io/blog/qwen3/
  • https://qwenlm.github.io/blog/qwen3-embedding/
  • https://huggingface.co/collections/Qwen/qwen3-66850ac008e23f2e87b68084
  • https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/qwen3/modeling_qwen3.py

nyo16 avatar Oct 05 '25 13:10 nyo16

I will test it tomorrow with my h200 to be sure that everything is working. With my mbr the answers seems ok, but the generation is slow. My end goal is to add support for the embeddings and rerankers from qwen. Also comments are really welcome, i generated most of it with sonnet 4.5.

nyo16 avatar Oct 05 '25 13:10 nyo16

I was interested in getting a qwen3 vision model working like https://huggingface.co/huihui-ai/Huihui-MiniCPM-V-4_5-abliterated

fire avatar Oct 05 '25 22:10 fire

Generation looking good!

iex(16)>   prompt = """
...(16)>   <|im_start|>system
...(16)>   You are a helpful assistant.<|im_end|>
...(16)>   <|im_start|>user
...(16)>   What is the capital of France?<|im_end|>
...(16)>   <|im_start|>assistant
...(16)>   """
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"
iex(17)>
nil
iex(18)>   result = Nx.Serving.run(serving, prompt)
%{
  results: [
    %{
      text: "The capital of France is Paris.",
      token_summary: %{input: 26, output: 8, padding: 0}
    }
  ]
}

Still more tests to do and write!

nyo16 avatar Oct 06 '25 23:10 nyo16

@jonatanklosko i used a light model with qwen3 arch to write some basic tests similar to the other PR. Let me know if this is enough.

nyo16 avatar Oct 09 '25 10:10 nyo16

Sorry, worked caught up with me, I will continue the PR this weekend.

nyo16 avatar Oct 22 '25 23:10 nyo16

@jonatanklosko I managed to get some time. I compared the elixir implementation with Python transformer's

๐Ÿงช Test Environment

  • Python: transformers 4.57.1, torch 2.9.0 (bf16)
  • Elixir: Bumblebee (local), Nx 0.10.0, EXLA 0.10.0 (bf16)
  • Model: Qwen/Qwen3-Embedding-0.6B
  • Platform: macOS ARM64

Test 1: Basic Text ("hello!")

Metric Python (transformers) Elixir (Bumblebee) Difference
Norm 0.9961 0.9998 0.0037
Cosine Similarity โ€” โ€” 0.9998 โœ…
Mean Abs Diff โ€” โ€” 0.00053
Max Abs Diff โ€” โ€” 0.0027
First 10 embedding values
Index  Python              Elixir              Abs Diff
0      0.0004043579        0.0005552031        0.0001508
1     -0.0277099609       -0.0279187821        0.0002088
2     -0.0111694336       -0.0111040613        0.0000654
3     -0.0184326172       -0.0174492393        0.0009834
4     -0.0209960938       -0.0209390856        0.0000570
5      0.0031738281        0.0026967004        0.0004771
6     -0.0356445312       -0.0361675136        0.0005230
7      0.0869140625        0.0869289339        0.0000149
8     -0.0446777344       -0.0447335020        0.0000558
9     -0.0195312500       -0.0200666245        0.0005354

Test 2: Query with Instruction Format

Text: "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:hello!"

Metric Python (transformers) Elixir (Bumblebee) Difference
Norm 1.0000 1.0042 0.0042
Cosine Similarity โ€” โ€” 0.9992 โœ…
Mean Abs Diff โ€” โ€” 0.00096
Max Abs Diff โ€” โ€” 0.0051
First 10 embedding values
Index  Python              Elixir              Abs Diff
0      0.0043945312        0.0041164518        0.0002781
1     -0.0073242188       -0.0055703474        0.0017539
2     -0.0057067871       -0.0059557175        0.0002489
3     -0.0380859375       -0.0409192815        0.0028333
4     -0.0319824219       -0.0322309434        0.0002485
5     -0.0294189453       -0.0298486538        0.0004297
6     -0.0546875000       -0.0546524674        0.0000350
7      0.0268554688        0.0284473095        0.0015918
8     -0.0693359375       -0.0681053847        0.0012306
9     -0.0324707031       -0.0281670410        0.0043037

Do you think the small difference is because of bf16 / language implementation of float?


# Load model
{:ok, model_info} = Bumblebee.load_model(
  {:hf, "Qwen/Qwen3-Embedding-0.6B"},
  type: :bf16,
  architecture: :for_embedding
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-Embedding-0.6B"})

# Create serving
serving = Bumblebee.Text.TextEmbedding.text_embedding(
  model_info,
  tokenizer,
  output_attribute: :embedding,
  embedding_processor: :l2_norm
)

# For documents (no instruction)
document = "The capital of China is Beijing."
%{embedding: doc_emb} = Nx.Serving.run(serving, document)

# For queries (NO space after Query:)
task = "Given a web search query, retrieve relevant passages that answer the query"
query = "Instruct: #{task}\nQuery:What is the capital of China?"
%{embedding: query_emb} = Nx.Serving.run(serving, query)

# Compute similarity
similarity = Nx.dot(query_emb, doc_emb)

nyo16 avatar Nov 04 '25 23:11 nyo16

@jonatanklosko finally find again some time! i feel i addressed all the comments. Sorry for this large PR.

nyo16 avatar Nov 16 '25 22:11 nyo16