sqlflow icon indicating copy to clipboard operation
sqlflow copied to clipboard

[Proposal] Thoughts about the sqlflow optimizer

Open typhoonzero opened this issue 4 years ago • 0 comments

Generally, as a compiler, SQLFlow is performing the below steps to compile the SQL program to a workflow, the workflow is usually a .yaml file:

parse -> AST -> semantic analyze -> IR -> optimize -> IR -> code generation -> workflow

The SQLFlow "optimizer" will do below works:

  1. verify the data schema with the IR
  2. run feature derivation to auto derive features for training
  3. generate a dependency workflow

This is quite the same as how Hive and other SQL compilers do: https://cwiki.apache.org/confluence/display/Hive/Design. The Hive compiler will do the following steps:

parse -> AST -> semantic analyze -> logical plan -> optimize -> logical plan -> query plan generator (physical plan generator) -> physical plan

Yet these steps in Hive are used to generate "physical plan" for one SQL statement. Take below SQL program as an example:

CREATE TABLE pre_processed AS
SELECT COALESCE(col1, 0), ...., RAND() as rand_col FROM raw_table;

CREATE TABLE train AS
SELECT col1, ...., label FROM pre_processed WHERE rand_col < 0.8;

CREATE TABLE val AS
SELECT col1, ..., label FROM pre_processed WHERE rand_col > 0.8;

SELECT * FROM train
TO TRAIN DNNClassifier
LABEL 'label'
INTO model_1;

SELECT * FROM val
TO EVALUATE model_1
LABEL 'label';

Then the optimizer can generate the IR with dependency like:

image

Then the workflow will be executed like, each block is one executable "physical plan".

image

We can see that we actually don't need to create those temporary tables, and optimize the "logical plan" and generate a workflow to achieve better performance:

image

In this workflow, only two tables (train, val) will be created, which should have better performance than the original workflow.

So we can do more in the optimizer to analyze the "logical plan" generated from ASTs of the original SQL program, do some optimizations like merging operators to generate this workflow.

typhoonzero avatar Jul 13 '20 02:07 typhoonzero