presidio Batch anonymize CSV files via HTTP endpoint

Hi @omri374,
I'm looking for ways to handle PII before it enters our dataplatform hosted in Azure. If this is not the proper way to ask such questions let me know!

Use case We're trying de identify and anonymize PII that's coming from a local database. This flow should be mainly hosted in an Azure Data Factory self hosted integration runtime.

Flow:

Get data from source and write to local folder /RAW/file1.csv
Get file paths of files to process
Loop over all files that need to be processed
1. Send csv to local http docker endpoint (docker container)
2. (PRESIDIO) Anonymize columns X, Y, Z completely and look for PII in column A
3. Write anonymized file to /STAGING/file1.csv
Copy local folder /STAGING/file1.csv to Azure Data Lake storage

In most cases we should hash the complete column contents, in other cases it's part of a long text string.

SOURCE

Id	Name	Description	Additional field
1	Jane	This line is about Jane	Nothing secret here
2	Josh	This line is about Jane	Nothing to do here

TARGET RESULT

Id	Name	Description	Additional field
HASH41425	HASH65243	This line is about NAME	Nothing secret here
HASH95862	HASH19765	This line is about NAME	Nothing to do here

I went through the documentation already and I'm sure if the current state of Presidio can support this use case entirely. I've came across this discussio.

Things I'm uncertain about:

Can Presidio handle CSV files already?
Is it possible to tell Presidio to hash column X entirely?
Is the API able to batch anonymize? Based on the API reference it seems it's not.

Apr 20 '22 07:04 dsfrederic

Hi @dsfrederic,

Analyzing and anonymizing structured/semi-structured data is not yet supported in Presidio. In the meantime, we are collecting some feedback from users and looking for community contributions in this space.

There is no integrated support for anonymizing an entire column, but a naive approach would be to call the anonymizer on each cell.

This is definitely something on our roadmap, but until we get to it, we would be happy to help with specific questions around implementation.

Apr 22 '22 15:04 omri374

batch analyzer is now available in presidio https://microsoft.github.io/presidio/samples/python/batch_processing/ and a specific sample for csv files can be found here https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_csv.py

Dec 28 '22 13:12 navalev

presidio presidio copied to clipboard

Batch anonymize CSV files via HTTP endpoint

presidio
presidio copied to clipboard