datacompy icon indicating copy to clipboard operation
datacompy copied to clipboard

feature request: Mask sensitive columns

Open stephenpardy opened this issue 8 months ago • 3 comments

Issue

Users of datacompy sometimes have sensitive columns in their data (such as account IDs or other join keys). The comparison report will display these columns as-is leading to potential leakage of this information if not handled correctly. Users currently need to mask the sensitive information either before using datacompy or before sending the report.

Solution

Allow users to pass in a list of column names and mask those column values before outputing the comparison report, e.g.:

| ACCOUNT_ID | BALANCE | | 123 | 100.00 | | 456 | 200.00 | | 789 | 50.00 |

Becomes:

| ACCOUNT_ID | BALANCE | | ***** | 100.00 | | ***** | 200.00 | | ***** | 50.00 |

Alternatives

An alternative to masking is to hash values using a secure hashing algorithm before the performing the comparison. Values that match will be hashed to the same hash value.

stephenpardy avatar May 09 '25 15:05 stephenpardy

@stephenpardy been thinking about this a bit more. Are we just thinking about the join columns here? Or any columns. It feels a bit counter intuitive (at least to me) when you have a mismatch or something you would want to be able to see that. If it is masked it would impede a users investigation?

fdosani avatar Jul 08 '25 18:07 fdosani

@stephenpardy I think the hashing would be the way to go. The data would be masked, but you should still be able to see equivalent values in it. I am looking into adding an optional list of columns to mask, because I think that would be helpful.

@fdosani I think the hashing would resolve your concern about being counter intuitive.

TreWSte avatar Jul 22 '25 15:07 TreWSte

I am working on this PR for it (https://github.com/capitalone/datacompy/pull/434). I still need to work on it before it gets merged, but please let me know if this idea fulfills the goal or not.

TreWSte avatar Aug 15 '25 18:08 TreWSte