Simplify IBAN regex pattern and fix trailing character handling
Change Description
The IBAN recognizer used a complex regex with 8 capture groups and variable-length matching (3-5 characters). Replaced with a simpler pattern using consistent 4-character groups and 3 capture groups.
Pattern Changes
Before:
r"\b([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30})((?:[ \-]?[A-Z0-9]{3,5}){2})"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{1,3})?\b"
After:
r"(?<![A-Z0-9])([A-Z]{2}[0-9]{2}(?:[ -]?[A-Z0-9]{4}){2,6})"
r"((?:[ -]?[A-Z0-9]{4})?)((?:[ -]?[A-Z0-9]{1,3})?)(?![A-Z0-9])"
Key Improvements
-
Boundary detection: Word boundaries (
\b) replaced with negative lookahead/lookbehind to prevent mid-IBAN matching - Consistent grouping: Fixed 4-character groups instead of variable 3-5 character groups
- Validation fallback: 3 capture groups enable trying progressively shorter matches when validation fails (e.g., rejecting trailing " X" after valid IBAN)
- Documentation: Added inline comments explaining pattern structure and fallback mechanism
Behavior
# Correctly excludes trailing non-IBAN characters
"VG96 VPVG 0000 0123 4567 8901 X" # Matches IBAN only, not " X"
"DE89370400440532013000 2" # Matches IBAN only, not " 2"
# Still matches valid short segments
"BH67 BMAG 0000 1299 1234 56" # Matches including " 56"
Issue reference
Issue tracking handled separately
Checklist
- [x] I have reviewed the contribution guidelines
- [x] I have signed the CLA (if required)
- [x] My code includes unit tests
- [x] All unit tests and lint checks pass locally
- [x] My PR contains documentation updates / additions if required
Original prompt
validate_result can return false on some edge cases with extra checks
The user has attached the following files from their workspace:
- presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py
TITLE: IBAN Regex Pattern Fix for Trailing Character Matching
USER INTENT: Fix a bug where the IBAN regex pattern incorrectly matches trailing characters (like 'X') after a space, causing checksum validation to fail.
TASK DESCRIPTION: The user identified that in the string
'VG96VPVG0000012345678901 X', theXcharacter was being incorrectly matched as part of the IBAN, which then fails checksum validation. The regex pattern needed to be modified to properly handle IBAN boundaries when followed by spaces and single characters.EXISTING:
/Users/shhart/dev/presidio/presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py- IBAN recognizer implementation with regex pattern/Users/shhart/dev/presidio/presidio-analyzer/tests/test_iban_recognizer.py- Test file with many commented-out test cases (lines 330-372)PENDING:
- The
validate_resultmethod can return false on some edge cases with extra checks (user's final note)- Many test cases in the test file remain commented out and need to be enabled/verified
CODE STATE: The IBAN regex pattern was updated from various iterations to:
PATTERNS = [ Pattern( "IBAN Generic", r"(?<![A-Z0-9])([A-Z]{2}\d{2}(?:[ -]?[A-Z0-9]{4}){2,7}[A-Z0-9]{0,3})(?![A-Z0-9])", 0.5, ), ]RELEVANT CODE/DOCUMENTATION SNIPPETS: Key pattern components:
(?<![A-Z0-9])- negative lookbehind ensures we don't start mid-IBAN[A-Z]{2}\d{2}- country code (2 letters) + check digits (2 numbers)(?:[ -]?[A-Z0-9]{4}){2,7}- 2-7 groups of 4 alphanumerics with optional space/dash prefix[A-Z0-9]{0,3}- trailing 0-3 alphanumerics with NO leading separator (the key fix)(?![A-Z0-9])- negative lookahead ensures we don't end mid-IBANTest case that prompted the fix (line 337-341):
( "this is an iban VG96 VPVG 0000 0123 4567 8901 X in a sentence", 1, ((16, 45),), ),OTHER NOTES:
- The fix prevents matching
Xbecause the trailing[A-Z0-9]{0,3}does NOT allow a leading separator- The pattern should handle: IBANs without separators, with spaces, with dashes, in sentences, multiple IBANs, and IBANs followed by unrelated single characters
- There are warnings about unused imports (
RecognizerResult,NlpArtifacts) and commented code in the file
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.
Coverage report (presidio-anonymizer)
This PR does not seem to contain any modification to coverable code.
Coverage report (presidio-structured)
This PR does not seem to contain any modification to coverable code.
Coverage report (presidio-cli)
This PR does not seem to contain any modification to coverable code.
Coverage report (presidio-image-redactor)
This PR does not seem to contain any modification to coverable code.
Coverage report (presidio-analyzer)
This report was generated by python-coverage-comment-action
Click to see where and how coverage changed
File Statements Missing Coverage Coverage
(new stmts)Lines missing
presidio-analyzer/presidio_analyzer/predefined_recognizers/generic
iban_recognizer.py
Project Total