PipelineDP icon indicating copy to clipboard operation
PipelineDP copied to clipboard

Investigate possibility of PipelineDP API for Beam SQL

Open dvadym opened this issue 2 years ago • 0 comments

Context

PipleineDP supports anonymzation with Beam RDD API (example). It seems interesting to have the support of Beam SQL API.

Goal

To investigate and to design BeamSQL API for PipelineDP.

Possible example of PipelineDP BeamSQL API:

private_sql = pipeline_dp.PrivateBeamSql(<DP parameters>)
result = private_sql.query("SELECT * ...")

Note: This task consists for researching possible options (both API and implementation design) and proposing something that is useful for users and might be implemented reasonably simple.

Additional information

On PipelineDP Architecture

DPEngine (code) class which implements Differential Private (DP) logic independently of the pipeline framework (now run with Apache Spark, Apache Beam and w/o framework is supported).

DPEngine.aggregate() is the main method, which can perform any supported DP aggregations. Basically it's equivalent of running SQL query

SELECT dp_aggregate_function_1(value), ..., dp_aggregate_function_n(value)
GROUP BY partition_key
FROM collection
<with additional required for DP parameters>

where supported dp_aggregate_function are from the metric list.

On implementation

The implementation will likely be parsing of SQL and calling of DPEngine.aggregate().

Open questions from Beam

  1. Is it possible to add aditional operators/function to Beam SQL, for example something like
 SELECT ...
 FROM ...
 DP_PARAMETERS = (...)
  1. Having a SQL string "SELECT ...", how to transform it to python code with call of DPEngine.aggregate

dvadym avatar Jun 02 '22 08:06 dvadym