presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Simplify IBAN regex pattern and fix trailing character handling

Open Copilot opened this issue 3 months ago • 5 comments

Change Description

The IBAN recognizer used a complex regex with 8 capture groups and variable-length matching (3-5 characters). Replaced with a simpler pattern using consistent 4-character groups and 3 capture groups.

Pattern Changes

Before:

r"\b([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30})((?:[ \-]?[A-Z0-9]{3,5}){2})"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?"
r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{1,3})?\b"

After:

r"(?<![A-Z0-9])([A-Z]{2}[0-9]{2}(?:[ -]?[A-Z0-9]{4}){2,6})"
r"((?:[ -]?[A-Z0-9]{4})?)((?:[ -]?[A-Z0-9]{1,3})?)(?![A-Z0-9])"

Key Improvements

  • Boundary detection: Word boundaries (\b) replaced with negative lookahead/lookbehind to prevent mid-IBAN matching
  • Consistent grouping: Fixed 4-character groups instead of variable 3-5 character groups
  • Validation fallback: 3 capture groups enable trying progressively shorter matches when validation fails (e.g., rejecting trailing " X" after valid IBAN)
  • Documentation: Added inline comments explaining pattern structure and fallback mechanism

Behavior

# Correctly excludes trailing non-IBAN characters
"VG96 VPVG 0000 0123 4567 8901 X"  # Matches IBAN only, not " X"
"DE89370400440532013000 2"         # Matches IBAN only, not " 2"

# Still matches valid short segments
"BH67 BMAG 0000 1299 1234 56"      # Matches including " 56"

Issue reference

Issue tracking handled separately

Checklist

  • [x] I have reviewed the contribution guidelines
  • [x] I have signed the CLA (if required)
  • [x] My code includes unit tests
  • [x] All unit tests and lint checks pass locally
  • [x] My PR contains documentation updates / additions if required
Original prompt

validate_result can return false on some edge cases with extra checks

The user has attached the following files from their workspace:

  • presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py

TITLE: IBAN Regex Pattern Fix for Trailing Character Matching

USER INTENT: Fix a bug where the IBAN regex pattern incorrectly matches trailing characters (like 'X') after a space, causing checksum validation to fail.

TASK DESCRIPTION: The user identified that in the string 'VG96VPVG0000012345678901 X', the X character was being incorrectly matched as part of the IBAN, which then fails checksum validation. The regex pattern needed to be modified to properly handle IBAN boundaries when followed by spaces and single characters.

EXISTING:

  • /Users/shhart/dev/presidio/presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/iban_recognizer.py - IBAN recognizer implementation with regex pattern
  • /Users/shhart/dev/presidio/presidio-analyzer/tests/test_iban_recognizer.py - Test file with many commented-out test cases (lines 330-372)

PENDING:

  • The validate_result method can return false on some edge cases with extra checks (user's final note)
  • Many test cases in the test file remain commented out and need to be enabled/verified

CODE STATE: The IBAN regex pattern was updated from various iterations to:

PATTERNS = [
    Pattern(
        "IBAN Generic",
        r"(?<![A-Z0-9])([A-Z]{2}\d{2}(?:[ -]?[A-Z0-9]{4}){2,7}[A-Z0-9]{0,3})(?![A-Z0-9])",
        0.5,
    ),
]

RELEVANT CODE/DOCUMENTATION SNIPPETS: Key pattern components:

  • (?<![A-Z0-9]) - negative lookbehind ensures we don't start mid-IBAN
  • [A-Z]{2}\d{2} - country code (2 letters) + check digits (2 numbers)
  • (?:[ -]?[A-Z0-9]{4}){2,7} - 2-7 groups of 4 alphanumerics with optional space/dash prefix
  • [A-Z0-9]{0,3} - trailing 0-3 alphanumerics with NO leading separator (the key fix)
  • (?![A-Z0-9]) - negative lookahead ensures we don't end mid-IBAN

Test case that prompted the fix (line 337-341):

(
    "this is an iban VG96 VPVG 0000 0123 4567 8901 X in a sentence",
    1,
    ((16, 45),),
),

OTHER NOTES:

  • The fix prevents matching X because the trailing [A-Z0-9]{0,3} does NOT allow a leading separator
  • The pattern should handle: IBANs without separators, with spaces, with dashes, in sentences, multiple IBANs, and IBANs followed by unrelated single characters
  • There are warnings about unused imports (RecognizerResult, NlpArtifacts) and commented code in the file

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Dec 15 '25 10:12 Copilot

Coverage report (presidio-anonymizer)

This PR does not seem to contain any modification to coverable code.

github-actions[bot] avatar Dec 15 '25 11:12 github-actions[bot]

Coverage report (presidio-structured)

This PR does not seem to contain any modification to coverable code.

github-actions[bot] avatar Dec 15 '25 11:12 github-actions[bot]

Coverage report (presidio-cli)

This PR does not seem to contain any modification to coverable code.

github-actions[bot] avatar Dec 15 '25 11:12 github-actions[bot]

Coverage report (presidio-image-redactor)

This PR does not seem to contain any modification to coverable code.

github-actions[bot] avatar Dec 15 '25 11:12 github-actions[bot]

Coverage report (presidio-analyzer)

Click to see where and how coverage changed
FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  presidio-analyzer/presidio_analyzer/predefined_recognizers/generic
  iban_recognizer.py
Project Total  

This report was generated by python-coverage-comment-action

github-actions[bot] avatar Dec 15 '25 11:12 github-actions[bot]