PipelineDP
PipelineDP copied to clipboard
Investigate possibility of PipelineDP API for Beam SQL
Context
PipleineDP supports anonymzation with Beam RDD API (example). It seems interesting to have the support of Beam SQL API.
Goal
To investigate and to design BeamSQL API for PipelineDP.
Possible example of PipelineDP BeamSQL API:
private_sql = pipeline_dp.PrivateBeamSql(<DP parameters>)
result = private_sql.query("SELECT * ...")
Note: This task consists for researching possible options (both API and implementation design) and proposing something that is useful for users and might be implemented reasonably simple.
Additional information
On PipelineDP Architecture
DPEngine
(code) class which implements Differential Private (DP) logic independently of the pipeline framework (now run with Apache Spark, Apache Beam and w/o framework is supported).
DPEngine.aggregate() is the main method, which can perform any supported DP aggregations. Basically it's equivalent of running SQL query
SELECT dp_aggregate_function_1(value), ..., dp_aggregate_function_n(value)
GROUP BY partition_key
FROM collection
<with additional required for DP parameters>
where supported dp_aggregate_function
are from the metric list.
On implementation
The implementation will likely be parsing of SQL and calling of DPEngine.aggregate().
Open questions from Beam
- Is it possible to add aditional operators/function to Beam SQL, for example something like
SELECT ...
FROM ...
DP_PARAMETERS = (...)
- Having a SQL string "SELECT ...", how to transform it to python code with call of DPEngine.aggregate