presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Batch anonymize CSV files via HTTP endpoint

Open dsfrederic opened this issue 2 years ago • 1 comments

Hi @omri374,
I'm looking for ways to handle PII before it enters our dataplatform hosted in Azure. If this is not the proper way to ask such questions let me know!

Use case We're trying de identify and anonymize PII that's coming from a local database. This flow should be mainly hosted in an Azure Data Factory self hosted integration runtime.

image

Flow:

  1. Get data from source and write to local folder /RAW/file1.csv
  2. Get file paths of files to process
  3. Loop over all files that need to be processed
    1. Send csv to local http docker endpoint (docker container)
    2. (PRESIDIO) Anonymize columns X, Y, Z completely and look for PII in column A
    3. Write anonymized file to /STAGING/file1.csv
  4. Copy local folder /STAGING/file1.csv to Azure Data Lake storage

In most cases we should hash the complete column contents, in other cases it's part of a long text string.

SOURCE

Id Name Description Additional field
1 Jane This line is about Jane Nothing secret here
2 Josh This line is about Jane Nothing to do here

TARGET RESULT

Id Name Description Additional field
HASH41425 HASH65243 This line is about NAME Nothing secret here
HASH95862 HASH19765 This line is about NAME Nothing to do here

I went through the documentation already and I'm sure if the current state of Presidio can support this use case entirely. I've came across this discussio.

Things I'm uncertain about:

  • Can Presidio handle CSV files already?
  • Is it possible to tell Presidio to hash column X entirely?
  • Is the API able to batch anonymize? Based on the API reference it seems it's not.

dsfrederic avatar Apr 20 '22 07:04 dsfrederic

Hi @dsfrederic,

Analyzing and anonymizing structured/semi-structured data is not yet supported in Presidio. In the meantime, we are collecting some feedback from users and looking for community contributions in this space.

There is no integrated support for anonymizing an entire column, but a naive approach would be to call the anonymizer on each cell.

This is definitely something on our roadmap, but until we get to it, we would be happy to help with specific questions around implementation.

omri374 avatar Apr 22 '22 15:04 omri374

batch analyzer is now available in presidio https://microsoft.github.io/presidio/samples/python/batch_processing/ and a specific sample for csv files can be found here https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_csv.py

navalev avatar Dec 28 '22 13:12 navalev