sqlflow
sqlflow copied to clipboard
Rethinking the code structure
Per discussions on the video meeting with @typhoonzero @shendiaomo
Two Components in SQLFlow
Compiler
- Front, the parser package parses SQL statement(s) and generates IR(s).
- Semantics Analysis(Runtime), which includes feature derivation, verifier, attribute filler and checker, model reloading.
- Optimizer(static), analyze the dependency of SQL statements, and generate a parallel.
- Backend, various code generator, which produces YAML(Argo workflow) file or AI program(TF/XGBoost), or programming program(optflow).
Interpreter
SQLFlow compiler generates a two-layers graph, and two kinds of interpreter execute each layer,
- First Graph(Argo Workflow), Argo controller is the interpreter.
- Secondary Graph(AI program), Python/PAI command-line/EDL command-line is the interpreter.
The Hoped Code Structure
/pkg
/interpreter(executor)
/graph(Argo)
/node(python/pai/alisa)
/compiler
/parser
/semantics analysis(runtime)
/feature_derivation
/verifier
/model_reload
/attribute filer && checker
/optimizer(static)
/parallel graph
/backend(codegen)
Incomplete Thinking About the Final
Insufficient and Thinking of the Current System
-
the workflow graph losses much detailed information.
We hope SQLFlow generates a more detailed graph. For example, if the graph can describe a group of TensorFlow ops running on CPU/GPU, .e.g. that we can optimize the AI pipeline throughput.
-
workflow can not get the best throughput.
The streaming graph can get better throughput. For example, we can custom a TensorFlow op to read data generated from the
SELECT
clause in streaming, instead of creating a temporary table.
Code Structure
pkg/
/interpreter
/argo(graph executor)
/node(subgraph executor)
/semantics analysis(runtime?JIT?)
/feature_derivation
/verifier
/model
/attribute filler && checker
/graph
/compiler
/parser
/optimizer(静态)
/parallel graph
/backend(graph)
Something that needs to discuss next:
- What program that the graph node executes.
- A flatten graph or two-layer graph.
is SQLFlow a compiler or interpreter? It doesn't make sense for it to be both.
We don't have and probability will never have two layer of graphs. We don't have any graph in our system. The top level is a workflow represented by YAML. A workflow is not a graph. An Argo/Tekton workflow can have conditionals, loops, and even function definitions and function calls, whereas a graph cannot have these. The lower level is a Python program, which is not a graph either. A Python program could have all kinds of control flows, but a graph cannot.
It is disappointing to see our team members still sticking to the plastic idea of "graph" known as TensorFlow's early version used it as a very non-professional form of IR. Especially those who experienced PaddlePaddle, which tried so hard to propose an IR that is much more powerful than graph. All the way, innovators like Chris Lattern have been introducing the professional form of IR into TensorFlow, but so sorry that people cannot see the efforts.
The Current Structure
After several structure-adjustment PRs (#2481 #2484 #2491 #2500 #2502 #2505 ), the current package structure has become:
pkg
├── attribute # semantics
├── codegen # codegen
├── database # basic lib
├── executor # step execution
├── ir # intermediate representation
├── log # basic lib
├── model # basic lib
├── modelzooserver # server
├── parser # syntax
├── pipe # basic lib
├── proto # protobuf definitions
├── sql # step execution
├── sqlflowserver # server
├── sqlfs # basic lib
├── step # step execution
├── tar # basic lib
├── test # basic lib
├── verifier # semantics
└── workflow # workflow execution
The Proposed Structure
There're still several problems:
- We can restructure the packages according to their functionalities as standard components of a compiler, for example: put
attribute
andverifier
in asemantics
package, put all basic libraries in abasic
package, putsqlflowserver
andmodelzooserver
in aserver
package - The
executor
generates code for step execution and executes the code subsequently. We should decouple the code generation phase and execution phase, and put the decoupled code incodegen
andstep
respectively. Similarly, because thesql
package callsexecutor
for step execution, the files insql
should be put instep
. After this stage, the package structure should be:pkg ├── basic │ ├── database │ ├── log │ ├── model │ ├── pipe │ ├── sqlfs │ ├── tar │ └── test ├── codegen │ ├── alisa.go │ ├── pai.go │ ├── ... │ └── couler ├── ir ├── parser ├── proto ├── semantics │ ├── attribute │ └── verifier ├── server │ ├── modelzooserver │ └── sqlflowserver └── execution ├── step │ └── executor.go └── workflow
- We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use
step -e
to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts. After this phase, we don't need thepkg/step
andcmd/step
anymore.
I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
- The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
- The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
- The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
- The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the pkg/execution/step
folder ?
3. We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use
step -e
to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts.
In the current architecture, we always run step -e {sql_statement}
in each step. It will bring the limit that one SQL statement is mapped to only one step. And the step binary does the parse and build IR work again in the step image. It contains some duplicated work.
In the future, one SQL statement can be translated into several steps. So it would be better that after translating the SQL program into a work flow. We can get what each step executes obviously such as Data Analysis
, Data Exploration
, Model training
instead of executing a general command step -e
.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the
pkg/execution/step
folder ?
No, we don't. We only have to move something like table_writer
to the basic
package.
I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
- The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
- The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
- The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
- The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
In a discussion with @Yancey1989 , we found that we still have to implement a feature derivation mechanism in Python like the previous migration prototype to make the proposed structure available.
The problem is that:
- The feature derivation mechanism must run in step execution.
- The codegen package in the current architecture depends heavily on feature derivation to generate python code.
As a result, we have to first generate a .yaml
in sqlflowserver
and secondly generate .py
s in the step
binary.
As a result, we have to first generate a
.yaml
insqlflowserver
and secondly generate.py
s in thestep
binary.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.
I think we can generate a .yaml file, for each step, call tensorflow/xgboost/pai
code generator to generate a submitter program entrypoint which is .py script, the following snippet .yaml is an very simple example:
steps:
name: step-1
args: ["python", "-c"]
command: |
import sqlflow.runtime.tensorflow
tensorflow.train(....)
That tensorflow.train
call feature derivation, verifier, and then train Tensorflow model.
We can also separate feature derivation, verifier, into separated steps to decouple workflow step logic.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
@wangkuiyi sqlflow
is good.
That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.
I am afraid that if we use this way, we would move many Go codes into Python.
In this way, the sqlflowserver may only do the following things:
- parse the SQL statements
- attribute checking
- monitor the workflow status
- codegen to call
tensorflow.train/predict/evaluate/...
All other things would be done in Python, for example:
- database connection (fetch samples to do any verification or derivation).
- feature derivation.
- generate feature column API calls of TensorFlow/XGBoost.
- ...
Python codes may be less maintainable than Go codes.
The updated code structure as the following reason based on https://github.com/sql-machine-learning/sqlflow/issues/2494#issuecomment-647692915
- move semantics from Go to
runtime
Python pacakge. - remove
basic
top folder. - move Go code to
go
folder.
|-go
| |--cmd
| | |--sqlflowserver // SQLFlow gRPC server
| | |--modelzooserver // SQLFlow Model Zoo gRPC server
| | `--sqlflow // SQLFlow command-line tool
| |--pkg
| | |--ir
| | |--parser
| | |--log
| | |--model
| | |--pipe
| | |--sqlfs
| | |--tar
| | |--test
| | |--codegen
| | | |--pai
| | | | |--tensorflow
| | | | |--xgboost
| | | | |--kmeans
| | | |--alisa
| | | |--tensorflow
| | | |--couler
| | | `--xgboost
| | |--server // SQLFlow server interface implementation
| | | |--proto
| | | |--run
| | | `--fetch
| | |--modelzoo
| | |--executor
| | | |--argo // Argo is workflow executor
| | | `--python // Python is workflow step executor
|-python
| |--sqlflow.runtime
| | |--pai
| | | `--tensorlfow/xgboost/shap
| | |--alisa
| | |--tensorflow
| | |--xgboost
| | |--feature_derivation
| | |--verifier
| | `--db_writer
| `--couler
`-java
The following tasks should be done:
- move Go code to
go
folder. - move feature derivation to
runtime
Python package. - move verifier to
runtime
Python package. - update codegen based on newly feature derivation and verifier code.
- move
pai/alisaSubmitter
toruntime
Python package and only keep Python as the workflow step interpreter.
Python codes may be less maintainable than Go codes.
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs
, and exports them as Python API, that we can call them in Python runtime
package. How do you think?
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?
@Yancey1989 It may be more complex. Let us find some ways for better maintainable Python codes, such as improving code coverage, etc.
TODOs for SQLFlow compiler refactor:
- Move
feature_derivation
to theruntime
Python package. - Separate
verifier
into two parts.- Compile-time does the attribute checker.
- Runtime verifies data schema.
- Refactor the existing code generators.
- update based on newly feature derivation and verifier code.
- add
codegen/pai
to generate PAI submitter program. - add
codegen/alisa
to generate Alisa submitter program.
- Move workflow step
response
Go package to Python.
There are two main problems from the above plans:
- Python code is harder to maintain then Go. We can do the following items to solve it:
- Google Code Style
- Improve code coverage
- Some type checker tools e.g. pytype to check Python types in static.
- ROI. We should move many Go code to Python, which takes about two man-months, do we do that immediately?
Supply a Python db-api
to access alisa
. @Yancey1989
- add
codegen/pai
to generate PAI submitter program.
During workflow compilation, for TO RUN
statement, we will have the flexibility to generate different command line call for the step according to deployment platform
and execution program
. Upgrade the compiler to generate the various code in the step according to these two or more variables.
- Vanillar Kubernetes && Python Program: python a_python_program.py param_a param_b
- Vainilla Kuberenetes && Executable Binary: an_execution_binary param_a param_b
- MaxCompute && Python Program: alisa.submitter a_python_program.py param_a param_b