noseyparker Rules with multibyte UTF-8 characters do not work right

Describe the bug Rules that contain multibyte UTF-8 characters do not behave as you would expect.

To Reproduce Here is a sample file, utf8rules.yml:

rules:

- name: UTF-8 Test Rule
  id: utf8.1

  # regular single-byte characters work without surprise
  pattern: |
    (?x)
    (
      good\ day
    )

  examples:
  - 'good day'

  negative_examples:
  - 'Good Day!'


# literal utf-8 multibyte characters also seem to work without surprise
- name: UTF-8 Test Rule
  id: utf8.2

  pattern: |
    (?x)
    (
      Güten\ Tag
    )

  examples:
  - 'Güten Tag'

  negative_examples:
  - 'güten Tag'


# When you use the case-insensitive (?i) flag, utf-8 multibyte characters DON'T
# work as you expect; presumably the single bytes of the `ü` are individually
# handled case-insensitively, which is the wrong thing
- name: UTF-8 Test Rule
  id: utf8.3

  pattern: |
    (?x)(?i)
    (
      Güten\ Tag
    )

  examples:
  - 'Güten Tag'
  - 'güten tag'

  negative_examples:
  # one would like this to actually match, but the (?i) flag doesn't interact
  # properly with multibyte characters in Nosey Parker
  - 'GÜTEN TAG'


# You can explicitly specify different multibyte UTF-8 characters using regex
# alternation, so as to verbosely approximate case-insensitivity.
# But you have to use regex alternation, not character classes, to avoid a
# Vectorscan error about `Unicode not allowed here`.
- name: UTF-8 Test Rule
  id: utf8.4

  pattern: |
    (?x)(?i)
    (
      G (?: ü | Ü ) ten\ Tag
    )

  examples:
  - 'Güten Tag'
  - 'güten tag'
  - 'GÜTEN TAG'


rulesets:

- name: UTF-8 Tests
  description: 'Tests for UTF-8 rule and input handling'
  id: utf8

  include_rule_ids:
  - utf8.1
  - utf8.2
  - utf8.3
  - utf8.4

Validate with noseyparker rules check --rules-path utf8rules.yml

Expected behavior Multibyte UTF-8 sequences would work as expected in all pattern contexts (character classes, etc), without surprise.

Actual behavior A bunch of workarounds are required.

Output of noseyparker --version This applies to all versions of Nosey Parker

Jan 16 '25 20:01 bradlarsen

Internally, Nosey Parker uses two regex engines to do its matching.

First, Vectorscan does simultaneous matching all the patterns of the enabled rules on the input. This runs VERY fast (something like 4GB/s per core), but only provides the ID of the pattern that matched and the end byte offset of the match. It also has "all matches" semantics, different from most other regex engines, and requires a pass to discard all but the longest of each match. This all happens in the Matcher::scan_blob function.

The Vectorscan C++ library is exposed via the vectorscan-rs crate.

Second, Rust's regex crate is run with the appropriate pattern on each of the Vectorscan matches to determine the start of the match and the content of the capture groups.

Anyway, both of these regex engines support UTF-8 patterns and inputs. It should be possible to enhance Nosey Parker so that multibyte UTF-8 characters appearing in rules behave without surprise. However, this will take some thought and implementation work.

Note that Vectorscan's UTF-8 support is limited to matching on well-formed UTF-8 inputs. This is NOT the case for the regex crate, whose bytes::Regex doesn't have such restrictions.

Jan 16 '25 22:01 bradlarsen

The possible implementation that seems like it would have the best quality is this:

Add a proper regex parser / frontend to Nosey Parker
Have the frontend compile away (?i) flags from the patterns, explicitly transforming that into character classes or regex alternation + byte sequences for multibyte UTF-8 characters

With this implementation:

The pattern strings given to either vectorscan-rs or regex for matching would then not contain any multibyte characters or (?i) flags
vectorscan would be able to do UTF-8 matching even on invalid UTF-8 inputs (which has undefined behavior using its built-in UTF-8 support)
There would be no surprises with multibyte character handling

It would be a bit of work though.

Jan 16 '25 22:01 bradlarsen

noseyparker noseyparker copied to clipboard

Rules with multibyte UTF-8 characters do not work right

noseyparker
noseyparker copied to clipboard