PipelineDP icon indicating copy to clipboard operation
PipelineDP copied to clipboard

Auto-tunning contribution bounding parameters

Open dvadym opened this issue 2 years ago • 2 comments

Context

DPEngine.aggregate is API function that performs DP aggregation. Parameters of this function include parameters related to contribution bounding (max_partitions_contributed, max_contributions_per_partition).

For max_partitions_contributed, max_contributions_per_partition if a privacy unit (definition of privacy unit and more on contribution bounding is here ) contributes more, contributions are sub-sampled. So smaller these parameters, bigger the ratio of input data which is dropped, but smaller DP noise is added. So there is a trade-off of choosing those parameters. Choosing optimal (in some sense) parameters is data dependent.

Goals

To introduce estimating "good" max_partitions_contributed and max_contributions_per_partition with differential privacy, that allow to achieve reasonable accuracy.

Note: this is researchy problem, it will require experimenting and knowledge of Differential Privacy. There is no clear best solution and there are many possible approaches.

Some ideas:

  1. Choosing max_partitions_contributed and max_contributions_per_partition might be performed independently (at least probably that's a good start)
  2. Choosing max_partitions_contributed and max_contributions_per_partition might be considered as a selection problem and for example exponential mechanism might be used.
  3. There might be many different score functions to be used - for example some tradeoff between dropped data and DP noise or that dropped data because of sampling are around some fixed ratio (say 1%).

I'm ready to help in discussions and experimenting.

dvadym avatar Apr 13 '22 09:04 dvadym

I would like to work on this.

RamSaw avatar May 13 '22 08:05 RamSaw

Thank you!

dvadym avatar May 13 '22 10:05 dvadym