arx icon indicating copy to clipboard operation
arx copied to clipboard

Parallelised or distributed version

Open Scarlethue opened this issue 10 years ago • 3 comments

I am looking at searching and annonymising data with a large number of records (at least 10m). One of the use cases is for horizontally integrating results from multiple locations without sharing the raw data. While the flash implementation is very fast at the moment it does not appear parallelised for large local sets or distributable for partitioned sets.

Scarlethue avatar Nov 17 '14 07:11 Scarlethue

(1) Anonymizing distributed datasets in a privacy-preserving manner

You might want to take a look at the approach that we developed based on ARX:

Florian Kohlmayer_, Fabian Prasser_, Claudia Eckert, Klaus A. Kuhn. A Flexible Approach to Distributed Data Anonymization. Journal of Biomedical Informatics, December 2013 http://dx.doi.org/10.1016/j.jbi.2013.12.002 (* Both authors contributed equally to this work.)

In this paper you will also find an overview of other potential solutions to this problem.

(2) Parallelizing ARX itself

We do have a private fork that prototypically parallelizes ARX to better exploit modern multicore architectures. We might add this functionality to ARX in a future release, depending on demand. We currently have no plans for developing a version of ARX that supports scale-out in a cluster. You should be able to run ARX with datasets consisting of 10's of millions of records on current server hardware. If you experience any limitations please let us know.

prasser avatar Nov 24 '14 12:11 prasser

@prasser I am trying to use ARX in my existing Spark data ingestion pipelines and looking for guidance. I was originally thinking to extend the dataframe and convert the dataframe into ARX Data object and run anonymizer, but not sure if this approach would work for large datasets.

lordlinus avatar Jan 09 '20 04:01 lordlinus

I'm not very familiar with Spark, so it's hard for me to help without any further details. In general, you need to create horizontal partitions and then process the partitions independently and merge the results. If a dataframe allows you to implement this, then it is the right way to go.

prasser avatar Jan 20 '20 13:01 prasser