outlines-core icon indicating copy to clipboard operation
outlines-core copied to clipboard

Feature : Binary Serialization for Index and Index Player Tool

Open agourdel opened this issue 1 month ago • 0 comments

Add Binary Serialization for Index and Index Player Tool

Overview

This PR introduces binary serialization capabilities for the Index structure, along with a web-based debugging tool for exploring FSM indices.

New Features

1. Index Serialization (save() and load())

Added two new methods to the Index structure:

save(path: &Path) -> Result<()>

Serializes the Index to a compressed binary file. This method:

  • Converts all Index data (vocabulary size, EOS token ID, initial state, final states, and transitions) into a compact binary format
  • Compresses the data using gzip compression (via flate2)
  • Writes the compressed data to the specified file path

Usage:

let index = Index::new(regex, &vocabulary)?;
index.save("index.outlines")?;

load(path: &Path) -> Result<Index>

Deserializes an Index from a compressed binary file. This static method:

  • Reads and decompresses the gzip file
  • Parses the binary format according to the specification
  • Reconstructs the complete Index structure with all states and transitions

Usage:

let index = Index::load("index.outlines")?;

Benefits:

  • Performance: Loading a pre-built Index is significantly faster than rebuilding it from regex and vocabulary
  • Storage: Gzip compression reduces file size by 50-90% depending on the data
  • Portability: Binary files can be shared and loaded across different environments
  • Caching: Enables efficient caching of complex FSM indices

2. Binary Format Specification

The serialization uses a custom binary format optimized for FSM representation:

Format Structure (uncompressed)

Component Size Description
vocab_size 32 bits Size of the vocabulary
eos_token_id 32 bits End-of-sequence token ID
initial_state_id 32 bits ID of the initial state
num_final_states 32 bits Number of final states
final_states 32 bits × N Array of final state IDs
index_type 8 bits Format version identifier (currently type 1)
num_states 32 bits Number of states with transitions
For each state:
└─ state_id 32 bits Current state ID
└─ num_transitions 32 bits Number of transitions from this state
└─ For each transition:
└─ token_id 32 bits Token that triggers the transition
└─ next_state_id 32 bits Destination state ID

Key Features:

  • All integers stored in little-endian format
  • The entire structure is compressed with gzip before writing to disk
  • The index_type field allows for future format extensions
  • Fixed-size fields enable efficient parsing

Full specification available in INDEX_BINARY_FORMAT.md.

3. Index Player Tool (tools/index_player.html)

A standalone HTML/CSS/JavaScript tool for debugging and exploring FSM indices.

Purpose

The Index Player serves as a debug and explanation tool that allows developers to:

  • Visualize FSM state transitions
  • Understand why a model might generate specific tokens
  • Explore valid token sequences for any given state
  • Debug regex-vocabulary compatibility issues
  • Track paths through the automaton

How It Works

The tool is a fully static, single-file application that runs entirely in the browser:

  1. Load Index File: Upload a binary .outlines file created with Index::save()

    • Automatically decompresses gzip using browser's native DecompressionStream API
    • Parses the binary format and reconstructs the FSM in memory
  2. Load Vocabulary (Optional): Upload a vocab.json file from HuggingFace

    • Maps token IDs to their string representations
    • Enables human-readable token display
  3. Interactive Exploration:

    • Current State Display: Shows the active state (highlighted if final)
    • Path History: Visual timeline of selected tokens
    • Generated Text: Real-time concatenation of token values (when vocab is loaded)
    • Available Transitions: Grid of all valid next tokens from current state
    • Navigation Controls:
      • Click any transition card to advance
      • Or type token ID manually
      • "Go Back" to undo last transition
      • "Reset" to return to initial state
  4. Visual Feedback:

    • Color-coded final states (green badges)
    • Token values highlighted in purple/gradient colors
    • Error messages for invalid transitions
    • Compact info panel showing FSM metadata

Screenshot

image

Use Cases

  • Model Debugging: Understand why a model generated unexpected output by tracing the valid path through the Index
  • Regex Validation: Verify that a regex pattern correctly matches expected token sequences
  • Education: Learn how FSM-based constrained generation works
  • Token Analysis: Discover which tokens are valid at any point in the generation process

Testing

Added comprehensive Rust tests for serialization:

  • test_save_and_load: Verifies round-trip serialization preserves Index integrity
  • test_save_and_load_multibyte: Tests with multi-byte Unicode characters (emojis)
  • test_load_nonexistent_file: Error handling for missing files
  • test_load_corrupted_file: Error handling for invalid data
  • test_save_preserves_file_size: Validates compression is working

All tests pass successfully.

Dependencies

  • Added flate2 crate for gzip compression/decompression

Files Changed

  • src/index.rs: Added save() and load() methods
  • src/error.rs: Added IOError variant for I/O operations
  • Cargo.toml: Added flate2 dependency
  • INDEX_BINARY_FORMAT.md: Complete binary format specification
  • tools/index_player.html: New interactive debugging tool
  • tests/create_index_binary.py: Example script for creating binary indices

Breaking Changes

None. This is a purely additive change.

agourdel avatar Nov 24 '25 21:11 agourdel