PipelineDP icon indicating copy to clipboard operation
PipelineDP copied to clipboard

Vector summation DP aggregation

Open dvadym opened this issue 2 years ago • 7 comments

Context

DPEngine.aggregate performs DP aggregations of scalar values (sum, count, mean etc). A set of computed metrics is controlled with metrics field of aggregate_params argument. The result of this function is a collection of (partition_key, named_tuple_with_requested_metrics)

Note: More details on the terminology is here.

Goals

Support of vector_sum in DPEngine.aggregate

The goal is Implement full support of vector_sum in DPEngine.aggregate, i.e. the values to aggregate are arrays of the same size, and output is (partition_key, named_tuple["array_sum": sum_of_vectors_per_partition_key]).

References:

  1. All metrics are aggregated with combiners (e.g. SumCombiner )
  2. There is already a low level function that applies Laplace/Gaussian mechanism to np arrays

This task can be slit in 2 parts:

  1. Implementing VectorSumCombiner, which performs aggregation
  2. Plumb VectorSumCombiner into DPEngine.aggregate

Expose vector_sum computation to high-level Beam and Spark APIs.

High-level Beam and Spark APIs are represented by PrivatePCollection and PrivateRDD classes and transformations on them. All DP computations are performed in DPEngine. PrivatePCollection and PrivateRDD keeps data in internal collection (PCollection or RDD correspondingly). They provide a guarantee, that only data that has been aggregated in a DP manner, using no more than the specified privacy budget can be extracted. Private Beam and Private Spark transformation are wrappers around DPEngine.aggregate. There are transformation for COUNT, MEAN etc.

variance transformation can be used as a good example:

  1. Beam implementation, tests.
  2. Spark implementation, tests

dvadym avatar Apr 18 '22 12:04 dvadym

I can take a look at this one

rialg avatar May 10 '22 09:05 rialg

Sure, go ahead! Thanks!

dvadym avatar May 10 '22 10:05 dvadym

IIUC, the VectorSumCombiner class will be similar to SumCombiner, but the AccumulatorType = np.ndarray. Is this correct?

rialg avatar May 12 '22 13:05 rialg

Yes, correct

dvadym avatar May 12 '22 13:05 dvadym

In order to use add_noise_vector, an object of AdditiveVectorNoiseParams needs to be created. AFAIK, CombinerParams should contain the attributes needed to populate AdditiveVectorNoiseParams. Would it make sense to extend AggregateParams with the missing fields for the vector noise?

For instance:

    max_norm: float
    l0_sensitivity: float
    linf_sensitivity: float
    norm_kind: pipeline_dp.aggregate_params.NormKind

rialg avatar May 12 '22 15:05 rialg

Good question, we need to introduce max_norm and norm_kind in AggregateParams.

l0_sensitivity = max_partitions_contributed linf_sensitivity = max_contributions_per_partition

dvadym avatar May 13 '22 06:05 dvadym

I'm trying to include VectorSumCombiner in DPEngine.aggregate. But, I would need to understand whether it should be used with the CompoundCombiners class. Should this case be considered as a separete branch in create_compound_combiner, similar to what happens with the metric pipeline_dp.Metrics.PRIVACY_ID_COUNT?

rialg avatar May 13 '22 13:05 rialg