phileas Reduce confidence when credit card spans are bordered by dashes

Fixes for #120 where credit cards are identified in UUID strings:

Portions of Java UUID masked as credit card:
{ "query":" { quote(account_token:\"47223179-9330-4259-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}
{ "query":" { quote(account_token:\"******************-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}

Prior to this PR, Phileas would have calculated a confidence of 0.9 for the match above. With this PR, a confidence of 0.6 would be calculated instead, and when a confidence > 0.7 condition is set, no redaction would be done.

Aug 16 '24 00:08 robfromboulder

Hi @jzonthemtn -- apologies for the novel but thought a few notes would help

Please start review with test cases, since a miss there affects everything 😀

I picked the reduced confidence values of 0.5 and 0.6 after looking through some of the other filter implementations and making a best guess.

The one tricky bit here is the order of operations -- using findSpans to get the spans also evaluates any conditions, but then I'm modifying confidence after findSpans has run. I used a little glue code here to fix the applied field since that was good enough for the test cases, but I'm not sure how you feel about this level of hackery.

If findSpans had an optional visitor callback, allowing the custom confidence value to be computed at the right moment, that might be ideal. A callback interface with (String input, Span span) would be sufficient to return control before the conditions are evaluated (and only once). Or something like that?

Anyway, I'm jazzed that the core changes are smaller than I expected, and would be even smaller if not for this hiccup around ordering 🤞

ps. Obviously any kind of feedback is welcome, no bits are too small to get wrong

Aug 16 '24 00:08 robfromboulder

@RobDickinson Awesome, thanks!

Agree about the callback. That would be a good way to do it.

The list of FilterPattern could help here but I think it will include the - in what's redacted and that might not be ideal. With the filter patterns, you can give it multiple regexes and then the one that matches the longest span will be used and the others will be discarded.

final Pattern creditCardPattern = Pattern.compile("\\b(?:\\d[ -]*?){13,16}\\b", Pattern.CASE_INSENSITIVE);
final FilterPattern creditcard1 = new FilterPattern.FilterPatternBuilder(creditCardPattern, 0.90).build();

final Pattern creditCardPatternWithPrecedingDash = Pattern.compile("-\\b(?:\\d[ -]*?){13,16}\\b", Pattern.CASE_INSENSITIVE);
final FilterPattern creditcardWithPrecedingDash = new FilterPattern.FilterPatternBuilder(creditCardPatternWithPrecedingDash, 0.60).build();

final Pattern creditCardPatternWithTrailingDash = Pattern.compile("\\b(?:\\d[ -]*?){13,16}\\b-", Pattern.CASE_INSENSITIVE);
final FilterPattern creditcardWithTrailingDash = new FilterPattern.FilterPatternBuilder(creditCardPatternWithTrailingDash, 0.60).build();

final Pattern creditCardPatternWithDashes = Pattern.compile("-\\b(?:\\d[ -]*?){13,16}\\b-", Pattern.CASE_INSENSITIVE);
final FilterPattern creditCardWithDashes = new FilterPattern.FilterPatternBuilder(creditCardPatternWithDashes, 0.50).build();

...

this.analyzer = new Analyzer(contextualTerms, creditcard1, creditcardWithPrecedingDash, creditcardWithTrailingDash, creditCardWithDashes);

But this will cause -374245455400126- to be redacted as XXXXXXXXXXXX instead of just -XXXXXXXXXXXX-. I assume that's not ideal for your use-case?

Aug 16 '24 12:08 jzonthemtn

But this will cause -374245455400126- to be redacted as XXXXXXXXXXXX instead of just -XXXXXXXXXXXX-. I assume that's not ideal for your use-case?

Correct @jzonthemtn , I don't think that's ideal but check my logic :-)

We're specifically looking to reduce confidence if the credit card match appears in a longer UUID string, like this one: 47223179-9330-4259-b66c-f2db26efb20c. Our requirement is to only redact high-confidence matches, so we'd like to use a condition like confidence > 0.7 and leave those UUIDs intact.

I had exactly the same instinct to change up the regexes, but it seems like the content of the identified span in these cases is fine. Looking one character before the span and one character after the span to influence the confidence of the span seems more natural than redrawing the boundary of the span (and only having one confidence level of 0.9).

ps. I realize that all filters aren't guaranteed to act alike -- but the phone number filter appears to reduce confidence based on characters outside the span. That's been helpful for reducing "false positives" (low-confidence matches) on phone numbers, so I was inspired to try to do something similar for credit cards 😀

Aug 16 '24 18:08 robfromboulder

@jzonthemtn the only other variation I can think of would be to use \b except dash instead of \b as the span boundary...but seems like that would underfit when the use-case is redacting all matches (even low-confidence ones). Seems better to overfit on the matches and then apply a confidence condition if needed.

If there's another experiment that I could try with my local datasets, I'm happy to do so 😀

Aug 20 '24 22:08 robfromboulder

@RobDickinson Take a look at https://github.com/philterd/phileas/compare/129-credit-card-dashes?expand=1 and see what you think. I added ConfidenceModifier class that lets you set/modify the confidence based on text before, after, or surrounding a span.

I think this would be helpful to eventually be exposed through the filter policy to let users define their own. I'd also like for it to accept a Condition at some point and let it evaluate that.

I also think this could be useful for more than just credit card numbers so wanted to see if it could be more reusable. I think your idea about a callback to get the confidence is likely a more elegant solution but might be farther away.

Aug 25 '24 21:08 jzonthemtn

Hey @jzonthemtn I like the approach in 129-credit-card-dashes -- especially because this can be made policy-driven (even if not at first). The callback idea is flexible but not very governable, and I think your approach is better opinionated.

I'm happy just to see the test cases from this PR get absorbed -- I didn't expect any more than that to get merged with the obvious rough hacky edges 😀

(closing without merging)

Aug 28 '24 19:08 robfromboulder

PR #142 merged.

Aug 29 '24 00:08 jzonthemtn