smartnoise-core icon indicating copy to clipboard operation
smartnoise-core copied to clipboard

Script to convert proto graphs to Beam

Open Shoeboxam opened this issue 4 years ago • 6 comments

This could just be written in Python, although I would be interested in learning how the Beam communications layer works first.

Shoeboxam avatar May 14 '20 19:05 Shoeboxam

Is it correct to understand that the input to Apache Beam is the group of templates found in whitenoise-core/validator-rust/prototypes/components ?

mikephelan avatar May 27 '20 14:05 mikephelan

Yes! Each json file needs to have an equivalent Beam representation.

To be DP, there must be no two neighboring datasets (any two datasets that differ by one row) where the runtime succeeds on one dataset and fails on the other.

If a runtime does not implement a component, then it will fail on every dataset-- which is great, because this means runtimes (like beam) do not need to implement every component.

You can also ignore components that have no concrete implementation- like min, max, dp_xxx, to_xxx.

Shoeboxam avatar May 27 '20 16:05 Shoeboxam

The overall signature is- given a computation graph, privacy definition and release, return a release.

About the overall function (let's call it, distribute_release), I can help with the portion that traverses the graph and calls into the lower level functions (next comment)

Shoeboxam avatar May 27 '20 16:05 Shoeboxam

From my initial glance at Beam, you might expect each beam component you implement to take in a PCollection for each argument, and elementary python types for each option.

Each component implementation will need similar arguments as the rust runtime:

  • an option dict, equivalent to &self in rust
  • an arguments dict of PCollections
  • a privacy definition

The privacy definition can generally be ignored for now. The only relevant info inside the privacy definition are the user preferences to force constant time, constant memory, etc. We haven't matured to support that yet.

Let me know if you see issues with this proposed code structure!

Shoeboxam avatar May 27 '20 17:05 Shoeboxam

Using the example of the analysis notebook, I'm going to pose an example here. In the sixth code block, with the first comment line:

attempt 4 - succeeds!

there is an example using dp_mean and dp_variance. For this case, would we be sending a DAG to Apache Beam, containing two nodes, one for dp_mean and one for dp_variance?

mikephelan avatar May 29 '20 15:05 mikephelan

More likely, translate

materialize -> cast -> clamp -> impute -> resize -> (mean, variance)

Into a beam pipeline. I'm happy to add custom data loading components if beam doesn't make the same assumptions materialize does.

Shoeboxam avatar May 30 '20 14:05 Shoeboxam