presidio icon indicating copy to clipboard operation
presidio copied to clipboard

regex defined in ad_hoc_recognizers is always case-insensitive?

Open StefH opened this issue 7 months ago • 11 comments

Describe the bug The regex defined in ad_hoc_recognizers is always case-insensitive?

To Reproduce Start the latest analyzer docker image.

Send this request to endpoint:

{
    "text": "1100 AA / 1000 aa",
    "language": "en",
    "ad_hoc_recognizers": [
        {
            "name": "Dutch postcode recognizer",
            "supported_language": "en",
            "patterns": [
                {
                    "name": "Dutch PostCode",
                    "regex": "([1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2})",
                    "score": 1.0
                }
            ],
            "context": [
                "postcode"
            ],
            "supported_entity": "NL_POSTCODE"
        }
    ]    
}

The result is:

[
    {
        "analysis_explanation": null,
        "end": 7,
        "entity_type": "NL_POSTCODE",
        "recognition_metadata": {
            "recognizer_identifier": "Dutch postcode recognizer_140518657068288",
            "recognizer_name": "Dutch postcode recognizer"
        },
        "score": 1.0,
        "start": 0
    },
    {
        "analysis_explanation": null,
        "end": 17,
        "entity_type": "NL_POSTCODE",
        "recognition_metadata": {
            "recognizer_identifier": "Dutch postcode recognizer_140518657068288",
            "recognizer_name": "Dutch postcode recognizer"
        },
        "score": 1.0,
        "start": 10
    }
]

Expected behavior I would only expect the first postcode to be matched:

[
    {
        "analysis_explanation": null,
        "end": 7,
        "entity_type": "NL_POSTCODE",
        "recognition_metadata": {
            "recognizer_identifier": "Dutch postcode recognizer_140518657068288",
            "recognizer_name": "Dutch postcode recognizer"
        },
        "score": 1.0,
        "start": 0
    }
]

StefH avatar May 28 '25 15:05 StefH

By looking at the source-code, you can apparently define a "global_regex_flags" with a different value.

"ad_hoc_recognizers": [
        {
            "name": "Dutch postcode recognizer",
            "supported_language": "nl",
            "global_regex_flags": 24,
            "patterns": [
                {
                    "name": "Dutch PostCode",
                    "regex": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
                    "score": 1.0
                    
                }
            ],
            "deny_list": null,
            "context": [
                "postcode"
            ],
            "supported_entity": "NL_POSTCODE"
        }
    ],

Can you please add this to the API doc?

StefH avatar May 29 '25 22:05 StefH

For reference, these are the defaults

https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/presidio_analyzer/pattern_recognizer.py#L42

bvenn avatar Jun 04 '25 05:06 bvenn

The global_regex_flag parameter isn't part of the request. It can be modified when running the app: https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml#L3

In docker, you can pass a custom recognizers registry configuration: https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/Dockerfile#L6

omri374 avatar Jun 04 '25 07:06 omri374

Hello @bvenn / @omri374 , thanks for for reply, however I do already know about the flags and the default_recognizers.yaml

However when testing the latest official Docker image, the behavior observed is different. The default Docker image has value 26.

And when I send a request like this in postman, there 2 responses because the regex is using 26.

Image

But when I change the recognizer to use value 24, the behavior is as expected, so that new request property is used, there is only 1 result because 24 is used.

Image

StefH avatar Jun 04 '25 10:06 StefH

Apologies for the delay. Can you clarify the issue? In your first example (with flags=26), you're getting two answers, and in the second (with flags=24) you're getting one answer. Is this not the expected behavior?

omri374 avatar Jun 19 '25 11:06 omri374

@omri374

I'll try to explain.

:one:

Can you clarify the issue? In your first example (with flags=26), you're getting two answers, and in the second (with flags=24) you're getting one answer. Is this not the expected behavior?

This is indeed correct behavior. However the question remains why the decision is that the regex is case-insensitive by default.

:two:

Other observation still remains: it seems the flag is part of the request, but not described in the API-documentation? Image

:three:

Another issue still remains: I tried your suggestion: Image However, this does not seem to work, there is no global way to override the flags with a different value.

StefH avatar Jun 19 '25 11:06 StefH

Hi, On 1: mainly to reduce the number of false negatives On 2: yes, the api docs are not up to date in this case Oh 3: could you please provide more details on what you tried?

omri374 avatar Jun 22 '25 05:06 omri374

I set this in the default_recognizers.yml:

supported_languages: 
  - en
  - nl
global_regex_flags: 24

This can be seen if you follow these steps: See this repo: https://github.com/StefH/presidio-docker-test

1] clone https://github.com/StefH/presidio-docker-test

2] build-docker.ps1

3] docker run -d -p 5111:3000 sheyenrath/presidio-analyzer-test:latest

4] Post in Postman

{
    "text": "1200 AA ; 1200 aa",
    "language": "nl",
    "return_decision_process": true,
    "ad_hoc_recognizers": [
        {
            "name": "Dutch postcode recognizer",
            "supported_language": "nl",
            "patterns": [
                {
                    "name": "Dutch PostCode",
                    "regex": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
                    "score": 1.0
                }
            ],
            "deny_list": null,
            "context": [
                "postcode"
            ],
            "supported_entity": "NL_POSTCODE"
        }
    ]
}

5] The result does include 2 matches but only 1 should be there because only 1200 AA should be matched because the regex does only allow case-sensitive.

[
    {
        "analysis_explanation": {
            "original_score": 1.0,
            "pattern": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
            "pattern_name": "Dutch PostCode",
            "recognizer": "Dutch postcode recognizer",
            "regex_flags": 26,
            "score": 1.0,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Detected by `Dutch postcode recognizer` using pattern `Dutch PostCode`",
            "validation_result": null
        },
        "end": 7,
        "entity_type": "NL_POSTCODE",
        "score": 1.0,
        "start": 0
    },
    {
        "analysis_explanation": {
            "original_score": 1.0,
            "pattern": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
            "pattern_name": "Dutch PostCode",
            "recognizer": "Dutch postcode recognizer",
            "regex_flags": 26,
            "score": 1.0,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Detected by `Dutch postcode recognizer` using pattern `Dutch PostCode`",
            "validation_result": null
        },
        "end": 17,
        "entity_type": "NL_POSTCODE",
        "score": 1.0,
        "start": 10
    }
]

StefH avatar Jun 22 '25 05:06 StefH

Are you passing the path of default_recognizers as an environment variable?

https://github.com/microsoft/presidio/blob/904add5bb4b5714cf85a3f8a665b796dcd554cba/presidio-analyzer/Dockerfile#L6

https://github.com/microsoft/presidio/blob/904add5bb4b5714cf85a3f8a665b796dcd554cba/presidio-analyzer/app.py#L41

I've looked at the build-docker.ps1 but I'm not sure it replaces this with the default one.

omri374 avatar Jun 22 '25 10:06 omri374

I'm using the original repository and the default Dockerfile.

StefH avatar Jun 22 '25 11:06 StefH

It's still unclear if Presidio is using the default YAML or yours. A simple way to test this is to pass an invalid yaml. If Presidio loads as usual, it's not picking up your YAML, and you should pass the path to the yaml during Docker build.

omri374 avatar Jun 23 '25 11:06 omri374