presidio
presidio copied to clipboard
Batch anonymize CSV files via HTTP endpoint
Hi @omri374,
I'm looking for ways to handle PII before it enters our dataplatform hosted in Azure. If this is not the proper way to ask such questions let me know!
Use case We're trying de identify and anonymize PII that's coming from a local database. This flow should be mainly hosted in an Azure Data Factory self hosted integration runtime.
Flow:
- Get data from source and write to local folder
/RAW/file1.csv
- Get file paths of files to process
- Loop over all files that need to be processed
- Send csv to local http docker endpoint (docker container)
- (PRESIDIO) Anonymize columns X, Y, Z completely and look for PII in column A
- Write anonymized file to
/STAGING/file1.csv
- Copy local folder
/STAGING/file1.csv
to Azure Data Lake storage
In most cases we should hash the complete column contents, in other cases it's part of a long text string.
SOURCE
Id | Name | Description | Additional field |
---|---|---|---|
1 | Jane | This line is about Jane | Nothing secret here |
2 | Josh | This line is about Jane | Nothing to do here |
TARGET RESULT
Id | Name | Description | Additional field |
---|---|---|---|
HASH41425 | HASH65243 | This line is about NAME | Nothing secret here |
HASH95862 | HASH19765 | This line is about NAME | Nothing to do here |
I went through the documentation already and I'm sure if the current state of Presidio can support this use case entirely. I've came across this discussio.
Things I'm uncertain about:
- Can Presidio handle CSV files already?
- Is it possible to tell Presidio to hash column X entirely?
- Is the API able to batch anonymize? Based on the API reference it seems it's not.
Hi @dsfrederic,
Analyzing and anonymizing structured/semi-structured data is not yet supported in Presidio. In the meantime, we are collecting some feedback from users and looking for community contributions in this space.
There is no integrated support for anonymizing an entire column, but a naive approach would be to call the anonymizer on each cell.
This is definitely something on our roadmap, but until we get to it, we would be happy to help with specific questions around implementation.
batch analyzer is now available in presidio https://microsoft.github.io/presidio/samples/python/batch_processing/ and a specific sample for csv files can be found here https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_csv.py