coral icon indicating copy to clipboard operation
coral copied to clipboard

Introduce Symbolic Constraint Solver for SQL-Driven Data Generation

Open wmoustafa opened this issue 1 month ago • 0 comments

Introduce Symbolic Constraint Solver for SQL-Driven Data Generation

Overview

This PR introduces coral-data-generation, a symbolic constraint solver that inverts SQL expressions to derive input domain constraints. Instead of forward evaluation (generate → test → reject), it solves backward from predicates to derive what inputs must satisfy, enabling efficient test data generation with guaranteed constraint satisfaction.

Motivation

Problem: Traditional test data generation uses rejection sampling—generate random values, evaluate SQL predicates, discard mismatches. This is inefficient for complex nested expressions and cannot detect unsatisfiable queries.

Solution: Symbolic inversion treats SQL expressions as mathematical transformations with inverse functions. Starting from output constraints (e.g., = '50'), the system walks expression trees inward, applying inverse operations to derive input domains.

Examples

1. Nested String Operations

WHERE LOWER(SUBSTRING(name, 1, 3)) = 'abc'
→ name ∈ RegexDomain("^[aA][bB][cC].*$")
Generates: "Abc", "ABC123", "abcdef"

2. Cross-Domain Arithmetic

WHERE CAST(age * 2 AS STRING) = '50'
→ age ∈ IntegerDomain([25])

3. Date Extraction with Type Casting

WHERE SUBSTRING(CAST(birthdate AS STRING), 1, 4) = '2000'
→ birthdate ∈ DateDomain intersect RegexDomain("^2000-.*$")
Generates: 2000-01-15, 2000-12-31, 2000-06-20

4. Complex Nested Substring

WHERE SUBSTRING(SUBSTRING(product_code, 5, 10), 1, 3) = 'XYZ'
→ product_code must have 'XYZ' at positions 5-7
→ product_code ∈ RegexDomain("^.{4}XYZ.*$")

5. Contradiction Detection

WHERE SUBSTRING(name, 1, 4) = '2000' AND SUBSTRING(name, 1, 4) = '1999'
→ Empty domain (unsatisfiable - no data generated)

6. Date String Pattern Matching

WHERE CAST(order_date AS STRING) LIKE '2024-12-%'
→ order_date ∈ RegexDomain("^2024-12-.*$") ∩ DateFormatConstraint
Generates: 2024-12-01, 2024-12-15, 2024-12-31

Key Components

1. Domain System

  • Domain<T, D>: Abstract constraint representation supporting intersection, union, emptiness checking
  • RegexDomain: Automaton-backed string constraints (powered by dk.brics.automaton)
  • IntegerDomain: Interval-based numeric constraints with arithmetic closure
  • Cross-domain conversions: CastRegexTransformer bridges string ↔ numeric types

2. Transformer Architecture

Pluggable symbolic inversion functions implementing DomainTransformer:

  • SubstringRegexTransformer: Inverts SUBSTRING(x, start, len) with positional constraints
  • LowerRegexTransformer: Inverts LOWER(x) via case-insensitive regex generation
  • CastRegexTransformer: Cross-domain CAST inversion (string ↔ integer ↔ date)
  • PlusRegexTransformer: Arithmetic inversion: x + c = valuex = value - c
  • TimesRegexTransformer: Multiplication inversion: x * c = valuex = value / c

3. Relational Preprocessing

Normalizes Calcite RelNode trees for symbolic analysis:

  • ProjectPullUpController: Fixed-point projection normalization
  • CanonicalPredicateExtractor: Extracts predicates with global field indexing
  • DnfRewriter: Converts to Disjunctive Normal Form for independent disjunct solving

4. Solver

DomainInferenceProgram: Top-down expression tree traversal with domain refinement at each step, detecting contradictions via empty domain intersection.

Technical Approach

Symbolic Inversion: For nested expression f(g(h(x))) = constant:

  1. Create output domain from constant
  2. Apply f⁻¹ → intermediate domain
  3. Apply g⁻¹ → refined domain
  4. Apply h⁻¹ → input constraint on x

Contradiction Detection: Multiple predicates on same variable → domain intersection. Empty result = unsatisfiable query.

Extensibility: Architecture supports multi-table inference (join propagation), fixed-point iteration (recursive constraints), and arbitrary domain types (date, decimal, enum).

Testing

Integration Tests (RegexDomainInferenceProgramTest): 14+ test scenarios covering simple/nested transformations, cross-domain CAST operations, arithmetic inversion, and contradiction detection. All tests validate generated samples satisfy original SQL predicates.

Documentation

This module comes with aomprehensive README with conceptual model, examples, and API reference.

Future Extensibility

The architecture naturally extends to additional domains (DecimalDomain, DateDomain), more transformers (CONCAT, REGEXP_EXTRACT), multi-table inference (join constraint propagation), and aggregate support (cardinality constraints).

wmoustafa avatar Dec 07 '25 07:12 wmoustafa