noseyparker
noseyparker copied to clipboard
Rules with multibyte UTF-8 characters do not work right
Describe the bug Rules that contain multibyte UTF-8 characters do not behave as you would expect.
To Reproduce
Here is a sample file, utf8rules.yml:
rules:
- name: UTF-8 Test Rule
id: utf8.1
# regular single-byte characters work without surprise
pattern: |
(?x)
(
good\ day
)
examples:
- 'good day'
negative_examples:
- 'Good Day!'
# literal utf-8 multibyte characters also seem to work without surprise
- name: UTF-8 Test Rule
id: utf8.2
pattern: |
(?x)
(
Güten\ Tag
)
examples:
- 'Güten Tag'
negative_examples:
- 'güten Tag'
# When you use the case-insensitive (?i) flag, utf-8 multibyte characters DON'T
# work as you expect; presumably the single bytes of the `ü` are individually
# handled case-insensitively, which is the wrong thing
- name: UTF-8 Test Rule
id: utf8.3
pattern: |
(?x)(?i)
(
Güten\ Tag
)
examples:
- 'Güten Tag'
- 'güten tag'
negative_examples:
# one would like this to actually match, but the (?i) flag doesn't interact
# properly with multibyte characters in Nosey Parker
- 'GÜTEN TAG'
# You can explicitly specify different multibyte UTF-8 characters using regex
# alternation, so as to verbosely approximate case-insensitivity.
# But you have to use regex alternation, not character classes, to avoid a
# Vectorscan error about `Unicode not allowed here`.
- name: UTF-8 Test Rule
id: utf8.4
pattern: |
(?x)(?i)
(
G (?: ü | Ü ) ten\ Tag
)
examples:
- 'Güten Tag'
- 'güten tag'
- 'GÜTEN TAG'
rulesets:
- name: UTF-8 Tests
description: 'Tests for UTF-8 rule and input handling'
id: utf8
include_rule_ids:
- utf8.1
- utf8.2
- utf8.3
- utf8.4
Validate with noseyparker rules check --rules-path utf8rules.yml
Expected behavior Multibyte UTF-8 sequences would work as expected in all pattern contexts (character classes, etc), without surprise.
Actual behavior A bunch of workarounds are required.
Output of noseyparker --version
This applies to all versions of Nosey Parker
Internally, Nosey Parker uses two regex engines to do its matching.
First, Vectorscan does simultaneous matching all the patterns of the enabled rules on the input. This runs VERY fast (something like 4GB/s per core), but only provides the ID of the pattern that matched and the end byte offset of the match. It also has "all matches" semantics, different from most other regex engines, and requires a pass to discard all but the longest of each match. This all happens in the Matcher::scan_blob function.
The Vectorscan C++ library is exposed via the vectorscan-rs crate.
Second, Rust's regex crate is run with the appropriate pattern on each of the Vectorscan matches to determine the start of the match and the content of the capture groups.
Anyway, both of these regex engines support UTF-8 patterns and inputs. It should be possible to enhance Nosey Parker so that multibyte UTF-8 characters appearing in rules behave without surprise. However, this will take some thought and implementation work.
Note that Vectorscan's UTF-8 support is limited to matching on well-formed UTF-8 inputs. This is NOT the case for the regex crate, whose bytes::Regex doesn't have such restrictions.
The possible implementation that seems like it would have the best quality is this:
- Add a proper regex parser / frontend to Nosey Parker
- Have the frontend compile away
(?i)flags from the patterns, explicitly transforming that into character classes or regex alternation + byte sequences for multibyte UTF-8 characters
With this implementation:
- The pattern strings given to either
vectorscan-rsorregexfor matching would then not contain any multibyte characters or(?i)flags - vectorscan would be able to do UTF-8 matching even on invalid UTF-8 inputs (which has undefined behavior using its built-in UTF-8 support)
- There would be no surprises with multibyte character handling
It would be a bit of work though.