smartnoise-core
smartnoise-core copied to clipboard
Group by / subsample and aggregate
Partial implementation is up in the groupby-aggregate
branch:
https://github.com/opendifferentialprivacy/whitenoise-core/tree/groupby-aggregate
What is desired: A unified system to split data, release aggregations of splits, and release aggregations of aggregations of splits.
Components output an instance of a Value:
pub enum Value {
ArrayND(ArrayND),
Hashmap(Hashmap<Value>),
Vector2DJagged(Vector2DJagged),
}
A new component, Partition is added, that takes in an ArrayND
and emits a Hashmap<ArrayND>
. Properties are tracked for each partition separately. When splitting equally, n remains statically known. When splitting by a clamped categorical variable, n is lost, and can be reclaimed via Count/Resize on Hashmaps. The keys of the hashmap are either indexes (I64) or categories (Str, Bool, I64).
Runtime: Aggregators are implemented for both the ArrayND and Hashmap variants- the ArrayND implementation emits an ArrayND with one row. The hashmap implementation maps over the values of the Hashmap of ArrayNDs, and emits an ArrayND with k rows, where k is the number of categories/partitions.
Validator: Sensitivity propagation cases within each aggregator's compute_sensitivity impl: [DATA INPUT] : [ACTION]
- Non-aggregated, non-partitioned: compute sensitivity, return 1*d sensitivity matrix
- Non-aggregated, partitioned: compute sensitivity for each partition, return k*d sensitivity matrix
- Aggregated, non-partitioned: combine sensitivities, return 1*d sensitivity matrix
- Aggregated, Partitioned: only applicable for multi-tier. Not currently representable
Sensitivity functions have been written for the first three cases on each aggregator. 1 was already written. 2 is oftentimes a mapping of 1. 3 is sparse.
Sensitivity propagation (3) is currently only implemented for the mean. This takes the maximum sensitivity of each of the partitions, and multiplies it by the sensitivity of the second aggregation. Taking the max here was mentioned by Christian.
These were implemented together because there is significant overlap in functionality. Lets say a user partitions their data and aggregates. This is the group by case. Now the user aggregates again. This is now the subsample and aggregate case.
Needed for unbiased privacy.