Official transformer architecture support for Dlib
Official transformer architecture support for Dlib
This pull request represents the consolidation and stabilization of transformer-related layers and components developed throughout 2024~2025. This substantial commit introduces official support for modern language modeling in Dlib, positioning the library as a new reference implementation for neural network building in natural language processing.
The extensions have been iteratively refined over the past year, with each component tested and validated across multiple architectures and use cases. This work establishes the foundation for upcoming multimodal capabilities, with active development underway for vision transformers and combined text-image processing.
Future releases will introduce examples demonstrating transformer architectures for image processing, followed by multimodal fusion combining textual and visual information. This PR represents an important milestone that could justify a new version of Dlib to mark the official introduction of these features.
Overview
This pull request introduces complete transformer architecture support to Dlib, enabling modern language modeling capabilities while maintaining Dlib's philosophy of simple APIs and production-ready implementations. All components are written in standard C++14 for cross-platform compatibility.
Major additions
Core architectural components
Attention mechanisms:
- multi-head self-attention with scaled dot-product computation
- canonical and fused transformer variants for research and production use
- causal masking for autoregressive generation
- rotary positional embeddings (RoPE) alongside absolute positional encodings
Specialized layers:
linearlayer with plane-wise matrix multiplication for sequence processingrms_normlayer implementing efficient RMS normalizationreshape_tolayer for dimension manipulation without data copyingtoken_embeddingslayer combining embedding lookup with positional encodingtrillayer for triangular mask generationtransposeandmultm_prevlayers for attention computationdropout_ratelayer with configurable per-layer dropout schedules
Advanced architectures:
- mixture-of-experts (MoE) with dynamic expert routing and load balancing
- hierarchical reasoning model (HRM) with dual recurrent modules
- adaptive computation time (ACT) for dynamic computation allocation
- SwiGLU gated activation for improved feed-forward networks
Language modeling utilities
Dataset preparation (language_model_data.h):
build_single_token_prediction_dataset()for autoregressive trainingbuild_multi_token_prediction_dataset()for sequence-to-sequence tasksshuffle_training_dataset()for data randomizationaugment_training_dataset()for noise injection and robustness improvement
Inference management:
inference_contextclass for autoregressive generation with sliding window
Evaluation metrics:
- edit distance (Levenshtein) with normalization
- token overlap metrics (precision, recall, F1-score)
- n-gram overlap (BLEU-like) for structural similarity
compute_text_similarity()combining all metrics
Preprocessing:
detect_file_type()supporting 30+ formats via magic numbers and entropy analysis
Complete transformer implementations
Canonical transformer (canonical_transformer namespace):
- explicit Q, K, V projections for modularity and research
transformer_blockcombining attention and feed-forward networkstransformer_stackfor building deep architectures
Fused transformer (fused_transformer namespace):
- combined QKV projection for memory and compute efficiency
- optimized for production deployment scenarios
- compatible API with canonical variant
Loss functions
Cross-entropy per logit (loss_cross_entropy_per_logit):
- specialized loss for sequence models working directly with linear layer output
- computes loss only at last sequence position
- avoids dimension flattening while preserving sequence structure
- numerically stable via log-sum-exp trick
Example programs
Three progressive examples demonstrate the capabilities:
slm_basic_train_ex.cpp: character-level transformer training on Shakespeare text (3 layers, 4 heads, 64-dim embeddings, ~5.2M parameters). Demonstrates fundamental attention mechanics and memorization capability.
slm_advanced_train_ex.cpp: BPE tokenization with compact architecture (4 layers, 6 heads, 228-dim embeddings, ~4M parameters). Introduces specialized loss function and byte-for-byte verification.
slm_mixture_of_experts_ex.cpp: sparse conditional computation with production-grade utilities (4 layers, 6 heads, 4 experts per layer, ~6M training / 5.4M inference parameters). Demonstrates shuffle and augmentation utilities for robust training.
Technical design
Matrix plane processing
Traditional Dlib layers operate channel-wise on 4D tensors. The extensions introduce plane-wise processing where (rows, cols) dimensions form semantic units for sequence data. This enables:
- natural representation:
(batch, 1, sequence_length, embedding_dim) - efficient attention computation over spatial planes
- seamless integration with existing Dlib computational graph
Implementation approach
All components follow Dlib's design patterns:
- header-only implementations where appropriate
- template-based abstractions for compile-time optimization
- compatibility with existing training infrastructure (dnn_trainer, optimizers, serialization)
- comprehensive inline documentation following Dlib's conventions
Testing and validation
The example programs demonstrate:
- perfect memorization on training data (99.99% accuracy for basic example)
- byte-for-byte reproduction capability (advanced example)
- balanced expert utilization in MoE (coefficient of variation < 0.3)
Main files modified/added
New headers:
dlib/dnn/transformer.h- complete transformer implementationsdlib/dnn/layers_transformer.h- specialized layers for sequence processingdlib/dnn/language_model_data.h- utilities for dataset preparation and evaluationdlib/tokenizer/bpe_tokenizer.h- byte-pair encoding tokenization
New examples:
examples/slm_basic_train_ex.cppexamples/slm_advanced_train_ex.cppexamples/slm_mixture_of_experts_ex.cppexamples/slm_data.h- internal datasets for examples
Abstract documentation:
docs/layers_abstract.h- layer specifications and usage patternsdocs/transformer_abstract.h- transformer architecture documentationdocs/language_model_data_abstract.h- language modeling utility documentation
Extended documentation
For more details, see the dedicated repository: https://github.com/Cydral/Dlib-Transformer-extensions
This contribution establishes official transformer support in Dlib, extending the library into modern natural language processing while maintaining its core values of simplicity, performance, and production readiness. The groundwork laid here enables upcoming vision transformer implementations and multimodal architectures, positioning Dlib as a comprehensive framework for contemporary deep learning applications.