data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

[BUG]Incorrect Behavior of Obfuscate Processor with Predefined Pattern "%{CREDIT_CARD_NUMBER}"

Open anudasari20 opened this issue 1 year ago • 4 comments

Describe the bug The issue arises when utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}" with the obfuscate processor in the OSI pipeline. The expected behavior is for the processor to exclusively mask credit card information within logs while leaving non-personally identifiable information (non-PII) fields untouched. However, in our current environment, we have observed that the obfuscate processor is erroneously masking non-PII fields such as trackingId and sdsStayGuid. This unintended behavior complicates troubleshooting efforts for application teams as critical data points become obscured.

Attaching some sceenshots where the data has been masked, image

image

Expected behavior When employing the patterns configuration option, users expect seamless integration with a predefined set of obfuscation patterns for common fields. Specifically, the obfuscate processor should seamlessly implement the predefined pattern "%{CREDIT_CARD_NUMBER}" without encountering errors. It is imperative that this processor selectively masks only credit card values within logs, while abstaining from obscuring any other field values that may resemble credit card patterns.

The trackingId's should not be masked as shown in this screenshot, image

Resolution: To rectify this issue, the implementation of the obfuscate processor requires refinement. The processor should be updated to accurately discern and mask solely credit card numbers within logs, adhering strictly to the predefined "%{CREDIT_CARD_NUMBER}" pattern. This necessitates a thorough review and potential adjustment of the pattern matching algorithm employed by the processor. Furthermore, comprehensive testing is essential to validate the updated processor's efficacy across diverse log scenarios, ensuring that it effectively safeguards credit card information while preserving the integrity of non-PII fields.

Steps to Reproduce:

  1. Configure the obfuscate processor within the OSI pipeline, utilizing the predefined pattern "%{CREDIT_CARD_NUMBER}".
  2. Analyze logs containing a mixture of credit card numbers and non-PII fields.
  3. Observe whether non-PII fields are erroneously masked alongside credit card numbers, impeding the troubleshooting process for application teams.

Example confgiuration

- obfuscate:
        source: 'data'
        patterns:
          - '%{CREDIT_CARD_NUMBER}'
        action:
          mask:
            mask_character: "&"
            mask_character_length: 10

Environment (please complete the following information):

  • OS: Amazon EC2 - Linux/UNIX
  • Version : AML 2.0 Additional context Add any other context about the problem here.

anudasari20 avatar Mar 26 '24 22:03 anudasari20

Pattern:

https://github.com/opensearch-project/data-prepper/blob/b7c63bc102d27cd3856e5cbafe1dac89775367f4/data-prepper-plugins/obfuscate-processor/src/main/java/org/opensearch/dataprepper/plugins/processor/obfuscation/CommonPattern.java#L12

dlvenable avatar Apr 02 '24 19:04 dlvenable

Hello @dlvenable, Just wanted to check, Would modifying the current pattern "(\\d[ -]*?){13,16}" to "\\b(?:\\d[ -]*?){13,16}\\b", help in this particular scenario ?

Utkarsh-Aga avatar Apr 08 '24 06:04 Utkarsh-Aga

Tested the scenario at my end and could observe the following -

Using Pattern - (\\d[ -]*?){13,16}

Input Data Output Data
fd55555069-e7a9-11ee4111111111111111 fd55555069-e7a9-11ee##########
4111111111111111 ##########
fd55555069-e7a9-11ee-91 fd55555069-e7a9-11ee-91

Using Pattern - \\b(?:\\d[ -]*?){13,16}\\b

Input Data Output Data
fd55555069-e7a9-11ee4111111111111111 fd55555069-e7a9-11ee4111111111111111
4111111111111111 ##########
fd55555069-e7a9-11ee-91 fd55555069-e7a9-11ee-91

So, based on the above, I feel that we can update the CREDIT_CARD_NUMBER pattern from (\\d[ -]*?){13,16} to \\b(?:\\d[ -]*?){13,16}\\b.

@dlvenable - Any comments on this ?

Utkarsh-Aga avatar Apr 16 '24 09:04 Utkarsh-Aga

@Utkarsh-Aga , Thank you for looking into this.

It seems the root of your solution is to add the word boundary (\b). But, what if there is a concatenation?

e.g.

visa4111111111111111

or

creditcard4111111111111111

I believe this would not match.

One option would be to add a configuration in the obfuscate processor itself to allow for word boundaries (e.g. single_word_only). Then any pattern could have this setting.

- obfuscate:
        source: "log"
        target: "new_log"
        single_word_only: true
        patterns:
          - '%{CREDIT_CARD_NUMBER}'

dlvenable avatar Apr 24 '24 16:04 dlvenable