jq icon indicating copy to clipboard operation
jq copied to clipboard

Migrate away from oniguruma?

Open dcermak opened this issue 8 months ago • 8 comments

https://github.com/kkos/oniguruma has been archived recently and will not receive any further updates. I would suggest to move away from it and switch to a different regex engine

dcermak avatar May 05 '25 11:05 dcermak

Apologies for posting "AI slop", but I'm auditing Claude Sonnet 4, figured I'd share what it came up with, looks like a decent start. Note that this is not a one-shot, the initial response had some old references (as in not as current as I would like). Edit: OK, it's really not one-shot, I ended up doing some searches and feeding it newer sources, and now I think I'm about to run https://github.com/BurntSushi/rebar locally to see what kind of benchmarks I get, whoops, stumbled into a rabbithole.

Top Replacement Options

PCRE2 is the most logical replacement for several reasons:

  • It provides compatibility with Oniguruma syntax^1, making migration easier
  • It's actively maintained and widely adopted^2
  • Recent comprehensive benchmarks show PCRE2-JIT performing excellently across diverse workloads^3
  • Has comprehensive Unicode support when configured properly
  • It supports some alternative regular expression syntax that does not conflict with the Perl syntax in order to provide some compatibility with regular expressions in Python, .NET, and Oniguruma^1

RE2 is another strong candidate:

  • Guarantees linear time complexity, preventing regex denial-of-service attacks
  • Recent benchmarks show solid performance across many scenarios^3
  • However, RE2 can be slower when capturing groups is required, which could be problematic for jq's use cases

Rust's regex crate (if interfacing from C is acceptable):

  • Shows excellent performance in recent comprehensive benchmarks, ranking first overall in search performance^3
  • Modern, well-maintained implementation with extensive testing
  • The rust regex crate now uses the same SIMD algorithm (Teddy) invented as part of Hyperscan^4

Recent Performance Data

According to the most comprehensive recent regex engine benchmark (rebar, 2023)^3, the overall rankings by geometric mean of speed ratios across diverse workloads are:

Search Performance Leaders:

  1. rust/regex - 1.54 ratio (best overall)
  2. dotnet/compiled - 1.80 ratio
  3. pcre2/jit - 1.86 ratio
  4. dotnet/nobacktrack - 2.33 ratio
  5. re2 - 2.48 ratio

Compile Performance Leaders:

  1. rust/regex/lite - 1.21 ratio (best overall)
  2. regress - 1.31 ratio
  3. icu - 1.33 ratio
  4. pcre2/jit - 1.36 ratio

Recommendation

PCRE2 appears to be the best replacement because:

  1. It offers explicit Oniguruma compatibility features
  2. Has excellent performance with JIT compilation, ranking 3rd overall in recent comprehensive benchmarks^3
  3. Is mature, stable, and actively maintained^2
  4. Supports the full range of regex features jq users expect
  5. Maintains backward compatibility for existing jq scripts
  6. Shows consistently strong performance across diverse regex patterns and workloads^3

The migration path would likely involve updating jq's build system to link against PCRE2 instead of Oniguruma, with minimal changes to the actual regex functionality that users depend on.

mrienstra avatar May 23 '25 20:05 mrienstra

Update: I ran rebar locally, below are the results:

Executive Summary

After comprehensive testing across 14 different benchmark suites with various patterns and workloads, PCRE2 with JIT emerges as the best choice for replacing Oniguruma in jq, with one important caveat: it remains vulnerable to pathological patterns that can cause exponential runtime behavior. Rust regex provides guaranteed linear-time execution that prevents regex denial-of-service attacks.

Test Environment

  • Hardware: MacBook Pro with Apple M1 Max (10 cores: 8 performance, 2 efficiency)
  • Memory: 64 GB
  • OS: macOS 14.7.5
  • Power: AC Power
  • System Load: 3.45 (prior to beginning test)

Benchmark Methodology

  • Target runtime: ~4.5 seconds per benchmark
  • Iterations: Automatically adjusted based on performance
    • Fast operations: Up to 1,000,000 iterations
    • Slow operations: As few as 2-4 iterations
    • Average: ~96,500 iterations per benchmark
  • Low iteration counts (2-10) occurred for pathologically slow cases:
    • PCRE2 on non-ASCII text without JIT (750ms+ per iteration)
    • Pathological backtracking patterns
    • Complex compilation benchmarks
  • Total benchmarks run: 180 successful tests across 14 benchmark suites
  • Statistical approach: Median times reported to minimize outlier impact

Key Findings

1. PCRE2/JIT Performance is Excellent

With JIT enabled, PCRE2 shows:

  • Competitive performance with Rust regex across most workloads
  • Superior performance in several important categories:
    • Case-insensitive matching: Often faster than Rust regex
    • Unicode character data parsing: 2x faster than Rust regex (549.7 MB/s vs 269.6 MB/s)
    • Word matching (English): Faster than all other engines
    • Complex capture patterns: Best overall performance

2. Compilation Performance

For pattern compilation (critical for jq's dynamic regex usage):

  • Rust regex is fastest (baseline)
  • PCRE2/JIT is 3-4x slower than Rust regex
  • PCRE2 (non-JIT) is only slightly faster than JIT version
  • RE2 is 10-20x slower than Rust regex

This is acceptable for jq since:

  • Compilation happens once per pattern
  • The 3-4x difference is measured in microseconds
  • Runtime performance matters more for jq's use cases

3. Unicode Handling

All engines handle Unicode well, but with different characteristics:

  • PCRE2 without JIT: Catastrophically slow (1000-10000x slower)
  • PCRE2/JIT: Excellent, matches Rust regex performance
  • RE2: Good but inconsistent (fails on some Unicode tests)
  • Rust regex: Excellent across all Unicode tests

4. Pathological Cases

For patterns that can cause exponential backtracking:

  • Rust regex: Best protection (no backtracking)
  • RE2: Good protection (designed for linear time)
  • PCRE2/JIT: Reasonable performance but can still exhibit quadratic behavior
  • PCRE2: Poor performance on pathological cases

5. Feature Completeness

Looking at which benchmarks each engine could run:

  • Rust regex: Ran all benchmarks successfully
  • PCRE2/JIT: Ran all benchmarks except one (AWS keys full pattern, more details below)
  • RE2: Failed several benchmarks (Unicode, complex patterns)
  • PCRE2: Ran all benchmarks but with poor performance

Detailed Performance Analysis

Literal Matching

Engine      English    Russian    Chinese    Case-Insensitive
---------   --------   --------   --------   ----------------
Rust regex  23.0 GB/s  23.5 GB/s  27.1 GB/s  Best overall
PCRE2/JIT   16.6 GB/s  22.1 GB/s  28.4 GB/s  Often fastest
RE2         7.6 GB/s   326 MB/s   920 MB/s   Moderate
PCRE2       4.8 GB/s   2.0 MB/s   49 MB/s    Poor

Complex Patterns

  • Date parsing: PCRE2/JIT is 6-10x faster than RE2, within 2x of Rust
  • Lexer patterns: PCRE2/JIT matches or beats Rust regex
  • AWS key detection: PCRE2/JIT handles most patterns well

Capture Groups

  • PCRE2/JIT: Best for complex capture patterns (549.7 MB/s vs 269.6 MB/s for Unicode parsing)
  • Rust regex: Good but not always optimal for captures
  • RE2: Moderate performance, some failures

Recommendation for jq

PCRE2 with JIT is the recommended choice for replacing Oniguruma because:

  1. Performance: Within 2x of Rust regex for most workloads, often faster for jq's use cases
  2. Compatibility: Supports Oniguruma features needed by existing jq scripts
  3. Unicode: Excellent support when JIT is enabled
  4. Maturity: Battle-tested in production environments
  5. Features: Supports backreferences, lookbehind, and other advanced features jq users expect

Implementation Notes

  1. JIT is mandatory: PCRE2 without JIT has unacceptable performance
  2. Compile with JIT support: Ensure PCRE2 is built with SUPPORT_JIT defined
  3. Enable JIT at runtime: Call pcre2_jit_compile() after pattern compilation
  4. Match limits: Set appropriate match limits to prevent DoS on pathological patterns

Performance Trade-offs

Compared to keeping Oniguruma:

  • ✅ Much better performance (10-100x faster with JIT)
  • ✅ Active maintenance and security updates
  • ✅ Better Unicode support
  • ✅ Compatible syntax (with minor adjustments)
  • ❌ Slightly larger binary size
  • ❌ JIT compilation adds minor overhead
  • ❌ Still vulnerable to ReDoS attacks

Should the ReDoS Vulnerability Tip the Scales to Rust?

Probably not for jq, for several reasons:

  1. Oniguruma has the same vulnerability - jq already has this issue, so PCRE2 doesn't make it worse
  2. jq is primarily a command-line tool - Unlike web services, jq typically processes trusted input
  3. Feature compatibility matters - Many jq scripts rely on backreferences and lookbehind assertions that Rust regex doesn't support
  4. PCRE2 has match limits - Can set MATCH_LIMIT and MATCH_LIMIT_DEPTH to bound execution time
  5. Performance on typical patterns - PCRE2/JIT often outperforms Rust on the complex Unicode patterns common in jq

However, if jq were being used in a server context processing untrusted input, Rust regex's guaranteed linear-time execution would become much more important.

Alternative Consideration

If ReDoS protection is critical, consider:

  1. Using Rust regex as primary engine with PCRE2 as fallback for unsupported features
  2. Pre-screening patterns with a ReDoS detector
  3. Implementing strict timeouts and resource limits
  4. Documenting the security implications clearly

Addendum: Some specific benchmarks of interest

  1. Pathological Quadratic Case - https://github.com/BurntSushi/rebar/blob/19aa8e8/benchmarks/definitions/curated/14-quadratic.toml
  • Pattern: .*[^A-Z]|[A-Z] on 1000 As
  • Performance (10x benchmark):
    • Rust regex: 1.2 MB/s (baseline)
    • RE2: 864 KB/s (1.4x slower)
    • PCRE2/JIT: 775 KB/s (1.6x slower)
    • PCRE2: 196 KB/s (6.3x slower)
  • Note: All engines tested show quadratic behavior, but PCRE2 without JIT is significantly worse
  1. CloudFlare ReDoS Pattern - https://github.com/BurntSushi/rebar/blob/19aa8e8/benchmarks/definitions/curated/06-cloud-flare-redos.toml#L124
  • Pattern: .*.*=.* on ~10KB haystack
  • Performance:
    • Rust regex: 44.8 GB/s (baseline)
    • RE2: 219.5 MB/s (209x slower)
    • PCRE2/JIT: 418.1 KB/s (112,000x slower)
    • PCRE2: 20.4 KB/s (2.3 million times slower!)
  • Note: This shows catastrophic backtracking - the pattern that took down Cloudflare
  1. Complex AWS Keys Pattern - https://github.com/BurntSushi/rebar/blob/19aa8e8/benchmarks/definitions/curated/09-aws-keys.toml#L45
  • Performance: PCRE2/JIT failed to run this pattern (error during execution), other engines handled it without issue
  • Complex alternation with multi-line lookaround appears to hit a PCRE2 edge case
  1. Bounded Repeat with Context - https://github.com/BurntSushi/rebar/blob/19aa8e8/benchmarks/definitions/curated/10-bounded-repeat.toml#L99
  • Pattern: [A-Za-z]{10}\s+[\s\S]{0,100}Result[\s\S]{0,100}\s+[A-Za-z]{10}
  • Performance:
    • PCRE2/JIT: 286.2 MB/s (fastest!)
    • Rust regex: 77.6 MB/s (3.7x slower than PCRE2/JIT)
    • PCRE2: 64.7 MB/s (4.4x slower than PCRE2/JIT)
    • RE2: 50.5 MB/s (5.7x slower than PCRE2/JIT)
  • Note: An example of PCRE2/JIT beating Rust - PCRE2/JIT often excels at complex patterns with captures
  1. Unicode Letter Matching - https://github.com/BurntSushi/rebar/blob/19aa8e8/benchmarks/definitions/curated/10-bounded-repeat.toml#L71
  • Pattern: \p{L}{8,13} on Russian text
  • Performance:
    • Rust regex: 456.3 MB/s (baseline)
    • PCRE2/JIT: 230.5 MB/s (2x slower)
    • RE2: 4.7 MB/s (96x slower)
    • PCRE2: 406 KB/s (1,151x slower)
  • Note: PCRE2 is 580x slower than PCRE2/JIT

mrienstra avatar May 25 '25 19:05 mrienstra

Anyone know how similar https://github.com/k-takata/Onigmo is to oniguruma? it seems to be the oniguruma fork ruby uses since ruby 2.0. Also would be interesting to see what paths other projects are taking. But think i'm mostly concerned about breaking backward compatibility.

wader avatar May 25 '25 20:05 wader

Although Onigmo looks easier to migrate from oniguruma, it's also maintained by one contributor and there're no updates recently. Luckily, oniguruma's development stopped after fixing issues with newer compiler, but Onigmo does not fix the issue (specifically about incompatible-pointer-types). PCRE2's development looks very active and promising. But let's migrate after the next release.

itchyny avatar May 29 '25 12:05 itchyny

Based on benchmarks run on 2025-05-29 comparing Ruby/Onigmo against PCRE2/JIT, PCRE2, Rust/regex, and RE2 (using rebar).

vs. PCRE2/JIT

  • Literal patterns: 3-10x slower (01-literal)
    • English text: 4x slower (2.3 GB/s vs 16.4 GB/s)
    • Case-insensitive: 4x slower (1985 MB/s vs 7.9 GB/s)
    • Russian text: 11x slower (2.1 GB/s vs 22.1 GB/s)
  • Complex patterns: 6-125x slower
    • Date extraction: 125x slower (952 KB/s vs 12.2 MB/s) (03-date)
    • Ruff noqa: 12x slower (99 MB/s vs 458 MB/s) (04-ruff-noqa)
  • Unicode processing: 11-105x slower
    • Russian case-insensitive: 105x slower (89.8 MB/s vs 9.2 GB/s) (01-literal)

vs. PCRE2 (no JIT)

  • Generally competitive or faster
  • Russian text: 9x faster than PCRE2 without JIT
    • PCRE2 catastrophically slow on Unicode without JIT (131 KB/s vs Ruby's 2.1 GB/s)
  • Compilation: Similar speeds (microseconds range)

vs. Rust/regex

  • Consistently slower: 5-20x on most patterns
  • CloudFlare ReDoS: 109x slower (3.8 MB/s vs 408 MB/s) (06-cloud-flare-redos)
  • Bounded repeats: 15x slower (34 MB/s vs 498 MB/s) (10-bounded-repeat)

vs. RE2

  • Mixed results: Sometimes faster, sometimes slower
  • Simple patterns: Comparable performance
  • Complex patterns: Generally 2-10x slower

Unicode Limitations

Ruby/Onigmo has ASCII-only character classes that cause benchmark failures (Onigmo documentation):

"Проверка".scan(/\w+/)     # => [] (empty - should match Cyrillic)
"123٤٥٦".scan(/\d+/)       # => ["123"] (misses Arabic digits)

Failed benchmarks due to Unicode issues:

  • 08-words/all-russian: Expected 107,391 matches, got 402 (08-words)
  • 08-words/long-russian: Expected 5,481 matches, got 12
  • 03-date/unicode: Expected 111,841 matches, got 111,817 (03-date)
  • 03-date/compile-unicode: Expected 5 matches, got 2

Multi-Pattern Limitations

Ruby lacks APIs for simultaneous multi-pattern matching, causing 8 benchmark failures:

Summary

Key advantages:

  • No catastrophic backtracking (handles pathological patterns safely)
  • Much faster than PCRE2 without JIT on Unicode text
  • Reasonable compilation times

Key disadvantages:

  • Significantly slower than modern optimized engines
  • ASCII-only \w, \b, \d character classes
  • No multi-pattern support (in Ruby's API, Onigmo by itself has support)

PS: PR to add Ruby/Onigmo to rebar

mrienstra avatar May 29 '25 22:05 mrienstra

What's next?

dlangille avatar Oct 24 '25 14:10 dlangille

I just found this thread after following the link to oniguruma in the jq documentation (see https://jqlang.org/manual/#regular-expressions) and noticing that oniguruma is no longer maintained. I'm only commenting in case it's useful for y'all to be reminded that the oniguruma is a fairly user-visible dependency.

RespiteSage avatar Oct 27 '25 18:10 RespiteSage

If anybody was thinking about using the Rust version, an issue is probably going to be platform support where Rust isn't available. This is another retroactive argument for PCRE2/JIT.

dad4x avatar Nov 19 '25 00:11 dad4x