regex defined in ad_hoc_recognizers is always case-insensitive?
Describe the bug The regex defined in ad_hoc_recognizers is always case-insensitive?
To Reproduce Start the latest analyzer docker image.
Send this request to endpoint:
{
"text": "1100 AA / 1000 aa",
"language": "en",
"ad_hoc_recognizers": [
{
"name": "Dutch postcode recognizer",
"supported_language": "en",
"patterns": [
{
"name": "Dutch PostCode",
"regex": "([1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2})",
"score": 1.0
}
],
"context": [
"postcode"
],
"supported_entity": "NL_POSTCODE"
}
]
}
The result is:
[
{
"analysis_explanation": null,
"end": 7,
"entity_type": "NL_POSTCODE",
"recognition_metadata": {
"recognizer_identifier": "Dutch postcode recognizer_140518657068288",
"recognizer_name": "Dutch postcode recognizer"
},
"score": 1.0,
"start": 0
},
{
"analysis_explanation": null,
"end": 17,
"entity_type": "NL_POSTCODE",
"recognition_metadata": {
"recognizer_identifier": "Dutch postcode recognizer_140518657068288",
"recognizer_name": "Dutch postcode recognizer"
},
"score": 1.0,
"start": 10
}
]
Expected behavior I would only expect the first postcode to be matched:
[
{
"analysis_explanation": null,
"end": 7,
"entity_type": "NL_POSTCODE",
"recognition_metadata": {
"recognizer_identifier": "Dutch postcode recognizer_140518657068288",
"recognizer_name": "Dutch postcode recognizer"
},
"score": 1.0,
"start": 0
}
]
By looking at the source-code, you can apparently define a "global_regex_flags" with a different value.
"ad_hoc_recognizers": [
{
"name": "Dutch postcode recognizer",
"supported_language": "nl",
"global_regex_flags": 24,
"patterns": [
{
"name": "Dutch PostCode",
"regex": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
"score": 1.0
}
],
"deny_list": null,
"context": [
"postcode"
],
"supported_entity": "NL_POSTCODE"
}
],
Can you please add this to the API doc?
For reference, these are the defaults
https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/presidio_analyzer/pattern_recognizer.py#L42
The global_regex_flag parameter isn't part of the request. It can be modified when running the app: https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml#L3
In docker, you can pass a custom recognizers registry configuration: https://github.com/microsoft/presidio/blob/1971b827b2a8887c5ce4ed75bedb0e0fe218423f/presidio-analyzer/Dockerfile#L6
Hello @bvenn / @omri374 , thanks for for reply, however I do already know about the flags and the default_recognizers.yaml
However when testing the latest official Docker image, the behavior observed is different. The default Docker image has value 26.
And when I send a request like this in postman, there 2 responses because the regex is using 26.
But when I change the recognizer to use value 24, the behavior is as expected, so that new request property is used, there is only 1 result because 24 is used.
Apologies for the delay. Can you clarify the issue? In your first example (with flags=26), you're getting two answers, and in the second (with flags=24) you're getting one answer. Is this not the expected behavior?
@omri374
I'll try to explain.
:one:
Can you clarify the issue? In your first example (with flags=26), you're getting two answers, and in the second (with flags=24) you're getting one answer. Is this not the expected behavior?
This is indeed correct behavior. However the question remains why the decision is that the regex is case-insensitive by default.
:two:
Other observation still remains: it seems the flag is part of the request, but not described in the API-documentation?
:three:
Another issue still remains: I tried your suggestion:
However, this does not seem to work, there is no global way to override the flags with a different value.
Hi, On 1: mainly to reduce the number of false negatives On 2: yes, the api docs are not up to date in this case Oh 3: could you please provide more details on what you tried?
I set this in the default_recognizers.yml:
supported_languages:
- en
- nl
global_regex_flags: 24
This can be seen if you follow these steps: See this repo: https://github.com/StefH/presidio-docker-test
1]
clone https://github.com/StefH/presidio-docker-test
2]
build-docker.ps1
3]
docker run -d -p 5111:3000 sheyenrath/presidio-analyzer-test:latest
4] Post in Postman
{
"text": "1200 AA ; 1200 aa",
"language": "nl",
"return_decision_process": true,
"ad_hoc_recognizers": [
{
"name": "Dutch postcode recognizer",
"supported_language": "nl",
"patterns": [
{
"name": "Dutch PostCode",
"regex": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
"score": 1.0
}
],
"deny_list": null,
"context": [
"postcode"
],
"supported_entity": "NL_POSTCODE"
}
]
}
5]
The result does include 2 matches but only 1 should be there because only 1200 AA should be matched because the regex does only allow case-sensitive.
[
{
"analysis_explanation": {
"original_score": 1.0,
"pattern": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
"pattern_name": "Dutch PostCode",
"recognizer": "Dutch postcode recognizer",
"regex_flags": 26,
"score": 1.0,
"score_context_improvement": 0,
"supportive_context_word": "",
"textual_explanation": "Detected by `Dutch postcode recognizer` using pattern `Dutch PostCode`",
"validation_result": null
},
"end": 7,
"entity_type": "NL_POSTCODE",
"score": 1.0,
"start": 0
},
{
"analysis_explanation": {
"original_score": 1.0,
"pattern": "\\b[1-9][0-9]{3}\\s?(?!SA|SD|SS)[A-Z]{2}\\b",
"pattern_name": "Dutch PostCode",
"recognizer": "Dutch postcode recognizer",
"regex_flags": 26,
"score": 1.0,
"score_context_improvement": 0,
"supportive_context_word": "",
"textual_explanation": "Detected by `Dutch postcode recognizer` using pattern `Dutch PostCode`",
"validation_result": null
},
"end": 17,
"entity_type": "NL_POSTCODE",
"score": 1.0,
"start": 10
}
]
Are you passing the path of default_recognizers as an environment variable?
https://github.com/microsoft/presidio/blob/904add5bb4b5714cf85a3f8a665b796dcd554cba/presidio-analyzer/Dockerfile#L6
https://github.com/microsoft/presidio/blob/904add5bb4b5714cf85a3f8a665b796dcd554cba/presidio-analyzer/app.py#L41
I've looked at the build-docker.ps1 but I'm not sure it replaces this with the default one.
I'm using the original repository and the default Dockerfile.
It's still unclear if Presidio is using the default YAML or yours. A simple way to test this is to pass an invalid yaml. If Presidio loads as usual, it's not picking up your YAML, and you should pass the path to the yaml during Docker build.