feature request: Mask sensitive columns
Issue
Users of datacompy sometimes have sensitive columns in their data (such as account IDs or other join keys). The comparison report will display these columns as-is leading to potential leakage of this information if not handled correctly. Users currently need to mask the sensitive information either before using datacompy or before sending the report.
Solution
Allow users to pass in a list of column names and mask those column values before outputing the comparison report, e.g.:
| ACCOUNT_ID | BALANCE | | 123 | 100.00 | | 456 | 200.00 | | 789 | 50.00 |
Becomes:
| ACCOUNT_ID | BALANCE | | ***** | 100.00 | | ***** | 200.00 | | ***** | 50.00 |
Alternatives
An alternative to masking is to hash values using a secure hashing algorithm before the performing the comparison. Values that match will be hashed to the same hash value.
@stephenpardy been thinking about this a bit more. Are we just thinking about the join columns here? Or any columns. It feels a bit counter intuitive (at least to me) when you have a mismatch or something you would want to be able to see that. If it is masked it would impede a users investigation?
@stephenpardy I think the hashing would be the way to go. The data would be masked, but you should still be able to see equivalent values in it. I am looking into adding an optional list of columns to mask, because I think that would be helpful.
@fdosani I think the hashing would resolve your concern about being counter intuitive.
I am working on this PR for it (https://github.com/capitalone/datacompy/pull/434). I still need to work on it before it gets merged, but please let me know if this idea fulfills the goal or not.