sqlflow
sqlflow copied to clipboard
Rethinking of SQLFlow IR
Rethinking SQLFlow IR
The SQLFlow compiler generates Argo workflow, which is a .YAML file.
Each workflow step call SQLFlow runtime library to submit an AI job or SQL job.
But the SQLFlow IR structure in the current code base just represents the parsed result of a SQL statement as the following:
type TrainStmt struct {
OriginalSQL string
Select string
ValidationSelect string
ModelImage string
Estimator string
Attributes map[string]interface{}
Features map[string][]FeatureColumn
Label FeatureColumn
PreTrainedModel string
Into string
TmpTrainTable string
TmpValidateTable string
}
The following is a rough thinking of re-design SQLFlow IR with SSA format to represent an execution plain.
The source code of SQLFlow program:
CREATE TABLE tbl_train AS SELECT ...;
SELECT * FROM tbl_train;
SELECT * FROM tbl_train
TO TRAIN DNNClassifier
WITH
n_classes=3,
hidden_units=[128, 56],
INTO my_model;
The parser component parsed the SQL program and generate an AST, then convert to the original SSA format IR as the following:
step() #0 {
%tbl = call <exec> "CREATE TABLE ..."
ret %tbl
}
step(%input table) #1 {
call <query> "SELECT * FROM %input"
}
step(%input table) #3 {
# seperate a training job into 3 sub-steps:
# 1. feature derivation to generate feature metas.
# 2. tensorflow training
# 3. save the trained model
%model = call <train>
ret %model
}
step(%saved_model model) #4 {
%result = call <prediction> %model
ret %result
}
main() {
%tbl := call step_0() # execute query
call step_1() # select clause
%model = call step_3(%tbl) # train a model
%result = call step_4() # prediction
}
The sematic analyzer pass tried to analyze each high-level abstract operators and generate low-level operators:
step() #0 {
%tbl = call <exec> "CREATE TABLE ..."
ret %tbl
}
step(%input table) #1 {
call <query> "SELECT * FROM %input"
}
step(%input table) #3 {
# seperate a training job into 3 sub-steps:
# 1. feature derivation to generate feature metas.
# 2. tensorflow training
# 3. save the trained model
%features = call <feature_derivation> %input
call <tensorflow_train> %features, %input
%model = call <save_model>
ret %model
}
step(%saved_model model) #4 {
%saved_model = call <load_model>
%result = call <prediction> %model
ret %result
}
main() {
%tbl := call step_0() # execute query
call step_1() # select clause
%model = call step_3(%tbl) # train a model
%result = call step_4() # prediction
}
The instruction pass generate platform related operators:
step() #0 {
%tbl = call <runtime.maxcompute.exec> "CREATE TABLE ..."
ret %tbl
}
step(%input table) #1 {
call <runtime.maxcompute.query> "SELECT * FROM %input"
}
step(%input table) #3 {
# seperate a training job into 3 sub-steps:
# 1. feature derivation to generate feature metas.
# 2. tensorflow training
# 3. save the trained model
%features = call <runtime.pai.feature_derivation> %input
call <runtime.pai.runtime.paiitensorflow_train> %features, %input
%model = call <runtime.pai.save_model>
ret %model
}
step(%saved_model model) #4 {
%saved_model = call <runtime.pai.load_model>
%result = call <runtime.pai.prediction> %model
ret %result
}
workflow() {
%tbl := call step_0() # execute query
call step_1() # select clause
%model = call step_3(%tbl) # train a model
%result = call step_4() # prediction
}
Finally, the SQLFlow code generator generates a .YAML file. The step block in IR
fill the .YAML as step spec, each operator mapping to a piece of runtime caller
program:
steps:
- name: step0
command: ["python", "-c"]
args: |
import runtime
runtime.maxcompute.exec("CREATE TABLE ...")
- name: step1
command: ["python", "-c"]
args: |
import runtime
runtime.maxcompute.query("SELECT ...")
- name: step2
command: ["python", "-c"]
args: |
import runtime
features = runtime.pai.feature_derivation(...)
runtime.pai.tensorflow.train(features, ...)
runtime.model.save()
- name: step3:
command: ["python", "-c"]
args: |
import runtime
model = runtime.model.load()
runtime.pai.predict(model)
Generally, SSA is generated from AST, not from IR, and the AST should have variables and assignment statements. Yet the SQL language often used without variables and assignments.
Generally, SSA is generated from AST, not from IR, and the AST should have variables and assignment statements. Yet the SQL language often used without variables and assignments.
Assignment statements are not necessary for generating LLVM IR, There's already some exploration that generates LLVM IR from SQL: https://github.com/Legacy25/ValkyrieDB and https://medium.com/grandata-engineering/introducing-snel-a-columnar-just-in-time-query-compiler-for-fast-on-line-analytics-9cf561f82526
Of course, we have to define an instruction set carefully.
I am confused why we need SSA IR. It seems that the proposal has nothing to do with SSA. Or should we find a better name to describe the proposed IR (IR in the instruction level or something else) instead of SSA IR?