arx
arx copied to clipboard
Parallelised or distributed version
I am looking at searching and annonymising data with a large number of records (at least 10m). One of the use cases is for horizontally integrating results from multiple locations without sharing the raw data. While the flash implementation is very fast at the moment it does not appear parallelised for large local sets or distributable for partitioned sets.
(1) Anonymizing distributed datasets in a privacy-preserving manner
You might want to take a look at the approach that we developed based on ARX:
Florian Kohlmayer_, Fabian Prasser_, Claudia Eckert, Klaus A. Kuhn. A Flexible Approach to Distributed Data Anonymization. Journal of Biomedical Informatics, December 2013 http://dx.doi.org/10.1016/j.jbi.2013.12.002 (* Both authors contributed equally to this work.)
In this paper you will also find an overview of other potential solutions to this problem.
(2) Parallelizing ARX itself
We do have a private fork that prototypically parallelizes ARX to better exploit modern multicore architectures. We might add this functionality to ARX in a future release, depending on demand. We currently have no plans for developing a version of ARX that supports scale-out in a cluster. You should be able to run ARX with datasets consisting of 10's of millions of records on current server hardware. If you experience any limitations please let us know.
@prasser I am trying to use ARX in my existing Spark data ingestion pipelines and looking for guidance. I was originally thinking to extend the dataframe and convert the dataframe into ARX Data object and run anonymizer, but not sure if this approach would work for large datasets.
I'm not very familiar with Spark, so it's hard for me to help without any further details. In general, you need to create horizontal partitions and then process the partitions independently and merge the results. If a dataframe allows you to implement this, then it is the right way to go.