Official transformer architecture support for Dlib

Open Cydral opened this issue 1 month ago • 0 comments

Official transformer architecture support for Dlib

This pull request represents the consolidation and stabilization of transformer-related layers and components developed throughout 2024~2025. This substantial commit introduces official support for modern language modeling in Dlib, positioning the library as a new reference implementation for neural network building in natural language processing.

The extensions have been iteratively refined over the past year, with each component tested and validated across multiple architectures and use cases. This work establishes the foundation for upcoming multimodal capabilities, with active development underway for vision transformers and combined text-image processing.

Future releases will introduce examples demonstrating transformer architectures for image processing, followed by multimodal fusion combining textual and visual information. This PR represents an important milestone that could justify a new version of Dlib to mark the official introduction of these features.

Overview

This pull request introduces complete transformer architecture support to Dlib, enabling modern language modeling capabilities while maintaining Dlib's philosophy of simple APIs and production-ready implementations. All components are written in standard C++14 for cross-platform compatibility.

Major additions

Core architectural components

Attention mechanisms:

multi-head self-attention with scaled dot-product computation
canonical and fused transformer variants for research and production use
causal masking for autoregressive generation
rotary positional embeddings (RoPE) alongside absolute positional encodings

Specialized layers:

linear layer with plane-wise matrix multiplication for sequence processing
rms_norm layer implementing efficient RMS normalization
reshape_to layer for dimension manipulation without data copying
token_embeddings layer combining embedding lookup with positional encoding
tril layer for triangular mask generation
transpose and multm_prev layers for attention computation
dropout_rate layer with configurable per-layer dropout schedules

Advanced architectures:

mixture-of-experts (MoE) with dynamic expert routing and load balancing
hierarchical reasoning model (HRM) with dual recurrent modules
adaptive computation time (ACT) for dynamic computation allocation
SwiGLU gated activation for improved feed-forward networks

Language modeling utilities

Dataset preparation (language_model_data.h):

build_single_token_prediction_dataset() for autoregressive training
build_multi_token_prediction_dataset() for sequence-to-sequence tasks
shuffle_training_dataset() for data randomization
augment_training_dataset() for noise injection and robustness improvement

Inference management:

inference_context class for autoregressive generation with sliding window

Evaluation metrics:

edit distance (Levenshtein) with normalization
token overlap metrics (precision, recall, F1-score)
n-gram overlap (BLEU-like) for structural similarity
compute_text_similarity() combining all metrics

Preprocessing:

detect_file_type() supporting 30+ formats via magic numbers and entropy analysis

Complete transformer implementations

Canonical transformer (canonical_transformer namespace):

explicit Q, K, V projections for modularity and research
transformer_block combining attention and feed-forward networks
transformer_stack for building deep architectures

Fused transformer (fused_transformer namespace):

combined QKV projection for memory and compute efficiency
optimized for production deployment scenarios
compatible API with canonical variant

Loss functions

Cross-entropy per logit (loss_cross_entropy_per_logit):

specialized loss for sequence models working directly with linear layer output
computes loss only at last sequence position
avoids dimension flattening while preserving sequence structure
numerically stable via log-sum-exp trick

Example programs

Three progressive examples demonstrate the capabilities:

slm_basic_train_ex.cpp: character-level transformer training on Shakespeare text (3 layers, 4 heads, 64-dim embeddings, ~5.2M parameters). Demonstrates fundamental attention mechanics and memorization capability.

slm_advanced_train_ex.cpp: BPE tokenization with compact architecture (4 layers, 6 heads, 228-dim embeddings, ~4M parameters). Introduces specialized loss function and byte-for-byte verification.

slm_mixture_of_experts_ex.cpp: sparse conditional computation with production-grade utilities (4 layers, 6 heads, 4 experts per layer, ~6M training / 5.4M inference parameters). Demonstrates shuffle and augmentation utilities for robust training.

Technical design

Matrix plane processing

Traditional Dlib layers operate channel-wise on 4D tensors. The extensions introduce plane-wise processing where (rows, cols) dimensions form semantic units for sequence data. This enables:

natural representation: (batch, 1, sequence_length, embedding_dim)
efficient attention computation over spatial planes
seamless integration with existing Dlib computational graph

Implementation approach

All components follow Dlib's design patterns:

header-only implementations where appropriate
template-based abstractions for compile-time optimization
compatibility with existing training infrastructure (dnn_trainer, optimizers, serialization)
comprehensive inline documentation following Dlib's conventions

Testing and validation

The example programs demonstrate:

perfect memorization on training data (99.99% accuracy for basic example)
byte-for-byte reproduction capability (advanced example)
balanced expert utilization in MoE (coefficient of variation < 0.3)

Main files modified/added

New headers:

dlib/dnn/transformer.h - complete transformer implementations
dlib/dnn/layers_transformer.h - specialized layers for sequence processing
dlib/dnn/language_model_data.h - utilities for dataset preparation and evaluation
dlib/tokenizer/bpe_tokenizer.h - byte-pair encoding tokenization

New examples:

examples/slm_basic_train_ex.cpp
examples/slm_advanced_train_ex.cpp
examples/slm_mixture_of_experts_ex.cpp
examples/slm_data.h - internal datasets for examples

Abstract documentation:

docs/layers_abstract.h - layer specifications and usage patterns
docs/transformer_abstract.h - transformer architecture documentation
docs/language_model_data_abstract.h - language modeling utility documentation

Extended documentation

For more details, see the dedicated repository: https://github.com/Cydral/Dlib-Transformer-extensions

This contribution establishes official transformer support in Dlib, extending the library into modern natural language processing while maintaining its core values of simplicity, performance, and production readiness. The groundwork laid here enables upcoming vision transformer implementations and multimodal architectures, positioning Dlib as a comprehensive framework for contemporary deep learning applications.

Nov 28 '25 12:11 Cydral