data-prepper
data-prepper copied to clipboard
[BUG]Incorrect Behavior of Obfuscate Processor with Predefined Pattern "%{CREDIT_CARD_NUMBER}"
Describe the bug The issue arises when utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}" with the obfuscate processor in the OSI pipeline. The expected behavior is for the processor to exclusively mask credit card information within logs while leaving non-personally identifiable information (non-PII) fields untouched. However, in our current environment, we have observed that the obfuscate processor is erroneously masking non-PII fields such as trackingId and sdsStayGuid. This unintended behavior complicates troubleshooting efforts for application teams as critical data points become obscured.
Attaching some sceenshots where the data has been masked,
Expected behavior When employing the patterns configuration option, users expect seamless integration with a predefined set of obfuscation patterns for common fields. Specifically, the obfuscate processor should seamlessly implement the predefined pattern "%{CREDIT_CARD_NUMBER}" without encountering errors. It is imperative that this processor selectively masks only credit card values within logs, while abstaining from obscuring any other field values that may resemble credit card patterns.
The trackingId's should not be masked as shown in this screenshot,
Resolution: To rectify this issue, the implementation of the obfuscate processor requires refinement. The processor should be updated to accurately discern and mask solely credit card numbers within logs, adhering strictly to the predefined "%{CREDIT_CARD_NUMBER}" pattern. This necessitates a thorough review and potential adjustment of the pattern matching algorithm employed by the processor. Furthermore, comprehensive testing is essential to validate the updated processor's efficacy across diverse log scenarios, ensuring that it effectively safeguards credit card information while preserving the integrity of non-PII fields.
Steps to Reproduce:
- Configure the obfuscate processor within the OSI pipeline, utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}".
- Analyze logs containing a mixture of credit card numbers and non-PII fields.
- Observe whether non-PII fields are erroneously masked alongside credit card numbers, impeding the troubleshooting process for application teams.
Example confgiuration
- obfuscate:
source: 'data'
patterns:
- '%{CREDIT_CARD_NUMBER}'
action:
mask:
mask_character: "&"
mask_character_length: 10
Environment (please complete the following information):
- OS: Amazon EC2 - Linux/UNIX
- Version : AML 2.0 Additional context Add any other context about the problem here.
Pattern:
https://github.com/opensearch-project/data-prepper/blob/b7c63bc102d27cd3856e5cbafe1dac89775367f4/data-prepper-plugins/obfuscate-processor/src/main/java/org/opensearch/dataprepper/plugins/processor/obfuscation/CommonPattern.java#L12
Hello @dlvenable,
Just wanted to check, Would modifying the current pattern "(\\d[ -]*?){13,16}" to "\\b(?:\\d[ -]*?){13,16}\\b", help in this particular scenario ?
Tested the scenario at my end and could observe the following -
Using Pattern - (\\d[ -]*?){13,16}
| Input Data | Output Data |
|---|---|
| fd55555069-e7a9-11ee4111111111111111 | fd55555069-e7a9-11ee########## |
| 4111111111111111 | ########## |
| fd55555069-e7a9-11ee-91 | fd55555069-e7a9-11ee-91 |
Using Pattern - \\b(?:\\d[ -]*?){13,16}\\b
| Input Data | Output Data |
|---|---|
| fd55555069-e7a9-11ee4111111111111111 | fd55555069-e7a9-11ee4111111111111111 |
| 4111111111111111 | ########## |
| fd55555069-e7a9-11ee-91 | fd55555069-e7a9-11ee-91 |
So, based on the above, I feel that we can update the CREDIT_CARD_NUMBER pattern from (\\d[ -]*?){13,16} to \\b(?:\\d[ -]*?){13,16}\\b.
@dlvenable - Any comments on this ?
@Utkarsh-Aga , Thank you for looking into this.
It seems the root of your solution is to add the word boundary (\b). But, what if there is a concatenation?
e.g.
visa4111111111111111
or
creditcard4111111111111111
I believe this would not match.
One option would be to add a configuration in the obfuscate processor itself to allow for word boundaries (e.g. single_word_only). Then any pattern could have this setting.
- obfuscate:
source: "log"
target: "new_log"
single_word_only: true
patterns:
- '%{CREDIT_CARD_NUMBER}'