pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Improve data redaction for PII/sensitive data

Open benkilimnik opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe.

Pixie supports a PIIRestricted data access mode that replaces instances of PII with <REDACTED_$TYPE> using redact_pii_best_effort. Currently only a limited number of PII types are redacted (IPs, emails, MAC addresses, IMEI, credit cards) and accuracy may vary.

Describe the solution you'd like

Add support for more data entities (such as names, bank account numbers, IDs) and improve the accuracy of row-based redaction (e.g. via machine learning, word-banks, keyword identification and/or schema learning).

Describe alternatives you've considered Enable users of Pixie to run custom ML (tensorflow) models for fine grained data redaction.

Additional Context Pixie supports two other data access modes: full: no data is redacted from the user during script execution restricted: all rows in columns that may potentially contain sensitive data will be redacted, regardless of whether they do or do not actually contain PII. By redacting the entire column, Pixie may remove useful information.

benkilimnik avatar Jun 02 '22 20:06 benkilimnik

Progress:

  • [x] Support redaction of Social Security Numbers and International Bank Account Numbers (IBAN) in redact_pii_best_effort
  • [x] Add Privy, a command line tool for generating synthetic protocol traces similar to those that Pixie collects from pods in a Kubernetes cluster. This data may be used for demo purposes, to train PII detection models, and to evaluate existing PII identification systems.
  • [x] Train and evaluate a binary PII classification model using data generated by Privy. View Demo
  • [x] Benchmark existing Named Entity Recognition models on data generated by Privy (Presidio, SpaCy).
  • [x] Train custom token-wise PII detection model using SpaCy
  • [x] Publish dataset, results, and models to Pixie blog

Future work:

  • [ ] Deploy a custom transformer-based PII detection model to Pixie via SpaCy cpp

benkilimnik avatar Aug 20 '22 01:08 benkilimnik

This is is complete (other than the blog, but we can track that separately).

zasgar avatar Sep 23 '22 00:09 zasgar