Rethinking SQLFlow IR

The SQLFlow compiler generates Argo workflow, which is a .YAML file. Each workflow step call SQLFlow runtime library to submit an AI job or SQL job.

But the SQLFlow IR structure in the current code base just represents the parsed result of a SQL statement as the following:

type TrainStmt struct {
  OriginalSQL string
  Select string
  ValidationSelect string
  ModelImage string
  Estimator string
  Attributes map[string]interface{}
  Features map[string][]FeatureColumn
  Label FeatureColumn
  PreTrainedModel string
  Into string
  TmpTrainTable    string
  TmpValidateTable string
}

The following is a rough thinking of re-design SQLFlow IR with SSA format to represent an execution plain.

The source code of SQLFlow program:

CREATE TABLE tbl_train AS SELECT ...;
SELECT * FROM tbl_train;

SELECT * FROM tbl_train
TO TRAIN DNNClassifier
WITH
  n_classes=3,
  hidden_units=[128, 56],
INTO my_model;

The parser component parsed the SQL program and generate an AST, then convert to the original SSA format IR as the following:

step() #0 {
  %tbl = call <exec> "CREATE TABLE ..."
  ret %tbl
}

step(%input table) #1 {
  call <query> "SELECT * FROM %input"
}

step(%input table) #3 {
  # seperate a training job into 3 sub-steps:
  # 1. feature derivation to generate feature metas.
  # 2. tensorflow training
  # 3. save the trained model
  %model = call <train>
  ret %model
}

step(%saved_model model) #4 {
  %result = call <prediction> %model
  ret %result
}

main() {
  %tbl := call step_0() # execute query
  call step_1() # select clause
  %model = call step_3(%tbl) # train a model
  %result = call step_4() # prediction
}

The sematic analyzer pass tried to analyze each high-level abstract operators and generate low-level operators:

step() #0 {
  %tbl = call <exec> "CREATE TABLE ..."
  ret %tbl
}
step(%input table) #1 {
  call <query> "SELECT * FROM %input"
}

step(%input table) #3 {
  # seperate a training job into 3 sub-steps:
  # 1. feature derivation to generate feature metas.
  # 2. tensorflow training
  # 3. save the trained model
  %features = call <feature_derivation> %input
  call <tensorflow_train> %features, %input
  %model = call <save_model>
  ret %model
}

step(%saved_model model) #4 {
  %saved_model = call <load_model>
  %result = call <prediction> %model
  ret %result
}

main() {
  %tbl := call step_0() # execute query
  call step_1() # select clause
  %model = call step_3(%tbl) # train a model
  %result = call step_4() # prediction
}

The instruction pass generate platform related operators:

step() #0 {
  %tbl = call <runtime.maxcompute.exec> "CREATE TABLE ..."
  ret %tbl
}
step(%input table) #1 {
  call <runtime.maxcompute.query> "SELECT * FROM %input"
}

step(%input table) #3 {
  # seperate a training job into 3 sub-steps:
  # 1. feature derivation to generate feature metas.
  # 2. tensorflow training
  # 3. save the trained model
  %features = call <runtime.pai.feature_derivation> %input
  call <runtime.pai.runtime.paiitensorflow_train> %features, %input
  %model = call <runtime.pai.save_model>
  ret %model
}

step(%saved_model model) #4 {
  %saved_model = call <runtime.pai.load_model>
  %result = call <runtime.pai.prediction> %model
  ret %result
}

workflow() {
  %tbl := call step_0() # execute query
  call step_1() # select clause
  %model = call step_3(%tbl) # train a model
  %result = call step_4() # prediction
}

Finally, the SQLFlow code generator generates a .YAML file. The step block in IR fill the .YAML as step spec, each operator mapping to a piece of runtime caller program:

steps:
  - name: step0
    command: ["python", "-c"]
    args: |
      import runtime
      runtime.maxcompute.exec("CREATE TABLE ...")
  - name: step1
    command: ["python", "-c"]
    args: |
      import runtime
      runtime.maxcompute.query("SELECT ...")
  - name: step2
    command: ["python", "-c"]
      args: |
        import runtime
        features = runtime.pai.feature_derivation(...)
        runtime.pai.tensorflow.train(features, ...)
        runtime.model.save()
  - name: step3:
    command: ["python", "-c"]
    args: |
      import runtime
      model = runtime.model.load()
      runtime.pai.predict(model)

Jul 22 '20 00:07 Yancey0623

Generally, SSA is generated from AST, not from IR, and the AST should have variables and assignment statements. Yet the SQL language often used without variables and assignments.

Jul 22 '20 02:07 typhoonzero

Generally, SSA is generated from AST, not from IR, and the AST should have variables and assignment statements. Yet the SQL language often used without variables and assignments.

Assignment statements are not necessary for generating LLVM IR, There's already some exploration that generates LLVM IR from SQL: https://github.com/Legacy25/ValkyrieDB and https://medium.com/grandata-engineering/introducing-snel-a-columnar-just-in-time-query-compiler-for-fast-on-line-analytics-9cf561f82526

Of course, we have to define an instruction set carefully.

Jul 22 '20 14:07 shendiaomo

I am confused why we need SSA IR. It seems that the proposal has nothing to do with SSA. Or should we find a better name to describe the proposed IR (IR in the instruction level or something else) instead of SSA IR?

Jul 23 '20 08:07 sneaxiy

sqlflow
sqlflow copied to clipboard

Rethinking of SQLFlow IR

Rethinking SQLFlow IR

sqlflow sqlflow copied to clipboard

Rethinking of SQLFlow IR

Rethinking SQLFlow IR

sqlflow
sqlflow copied to clipboard