smartnoise-core Group by / subsample and aggregate

Group by / subsample and aggregate

Open Shoeboxam opened this issue 4 years ago • 1 comments

Partial implementation is up in the groupby-aggregate branch: https://github.com/opendifferentialprivacy/whitenoise-core/tree/groupby-aggregate

What is desired: A unified system to split data, release aggregations of splits, and release aggregations of aggregations of splits.

Components output an instance of a Value:

pub enum Value {
    ArrayND(ArrayND),
    Hashmap(Hashmap<Value>),
    Vector2DJagged(Vector2DJagged),
}

A new component, Partition is added, that takes in an ArrayND and emits a Hashmap<ArrayND>. Properties are tracked for each partition separately. When splitting equally, n remains statically known. When splitting by a clamped categorical variable, n is lost, and can be reclaimed via Count/Resize on Hashmaps. The keys of the hashmap are either indexes (I64) or categories (Str, Bool, I64).

Runtime: Aggregators are implemented for both the ArrayND and Hashmap variants- the ArrayND implementation emits an ArrayND with one row. The hashmap implementation maps over the values of the Hashmap of ArrayNDs, and emits an ArrayND with k rows, where k is the number of categories/partitions.

Validator: Sensitivity propagation cases within each aggregator's compute_sensitivity impl: [DATA INPUT] : [ACTION]

Non-aggregated, non-partitioned: compute sensitivity, return 1*d sensitivity matrix
Non-aggregated, partitioned: compute sensitivity for each partition, return k*d sensitivity matrix
Aggregated, non-partitioned: combine sensitivities, return 1*d sensitivity matrix
Aggregated, Partitioned: only applicable for multi-tier. Not currently representable

Sensitivity functions have been written for the first three cases on each aggregator. 1 was already written. 2 is oftentimes a mapping of 1. 3 is sparse.

Sensitivity propagation (3) is currently only implemented for the mean. This takes the maximum sensitivity of each of the partitions, and multiplies it by the sensitivity of the second aggregation. Taking the max here was mentioned by Christian.

These were implemented together because there is significant overlap in functionality. Lets say a user partitions their data and aggregates. This is the group by case. Now the user aggregates again. This is now the subsample and aggregate case.

Mar 16 '20 16:03 Shoeboxam

Needed for unbiased privacy.

Apr 15 '20 15:04 raprasad

smartnoise-core smartnoise-core copied to clipboard

Group by / subsample and aggregate

smartnoise-core
smartnoise-core copied to clipboard