sqlflow
sqlflow copied to clipboard
Rethinking feature derivation as JIT compilation
SQLFlow is a compiler that compiles the SQL statements into an Argo workflow YAML. However, the feature derivation stage must be done during runtime instead of during compiling Argo YAML. Think about the following example:
CREATE TABLE a (...);
INSERT INTO a VALUES(...);
SELECT * FROM a TO TRAIN ...;
The table a is created in the previous SQL statement before the TO TRAIN statement. We can only do feature derivation after the previous SQL statement runs.
The feature derivation stage may generate:
- the TensorFlow/XGBoost feature column codes, or,
- the transforming DAG as described in this design doc.
This is something like the "Just In Time" technique in the compiler theory. We can think that the feature derivation stage is a JIT compilation stage to compile the COLUMN clause to be the feature column codes or the transforming DAG.
A real-world JIT compiler for comparison is NVRTC, which is a JIT compiler for compiling CUDA kernel codes. NVRTC is widely used in some compiler frameworks, such as TVM. TVM would use NVRTC as the CUDA compiler to compile the CUDA kernel codes during runtime.
Take jitify (a wrapper of NVRTC) for example. Users can write the jitify C++ codes to compile the CUDA codes in the string form, obtain a compiled C++ callable module, and launch CUDA kernels using the generated C++ callable module. All these procedures are done during runtime.
const char* program_source = "my_program\n"
"template<int N, typename T>\n"
"__global__\n"
"void my_kernel(T* data) {\n"
" T data0 = data[0];\n"
" for( int i=0; i<N-1; ++i ) {\n"
" data[0] *= data0;\n"
" }\n"
"}\n";
static jitify::JitCache kernel_cache;
jitify::Program program = kernel_cache.program(program_source);
// ...set up data etc.
dim3 grid(1);
dim3 block(1);
using jitify::reflection::type_of;
program.kernel("my_kernel")
.instantiate(3, type_of(*data)) // compile the CUDA kernel codes, and return a callable module
.configure(grid, block) // set the CUDA kernel parameters for running
.launch(data); // run the compiled CUDA kernel
The comparison between NVRTC/jitify and SQLFlow feature derivation stage is as follows:
| NVRTC/jitify | SQLFlow feature derivation stage | |
|---|---|---|
| Things to compile | a code in the string form | COLUMN clause in the SQL statement |
| Run ahead of time or just in time | just in time | just in time |
| Compilation output | a callable C++ object | a callable feature column codes or transforming DAG |
Reasonable. In previous discussions with @Yancey1989 and @shendiaomo, we've been talked about this.
Yet, We can somehow derive the feature columns in compile-time.
CREATE TABLE a (...);
INSERT INTO a VALUES(...);
SELECT * FROM a TO TRAIN ...;
- if the data is inserted to the training table in the SQL program, we can generate feature column definitions by parsing the
INSERTstatements. - if the data is already stored in the original table, we can execute the data preparation statements with a small dataset and generate feature columns.
The feature derivation stage may generate:
- the TensorFlow/XGBoost feature column codes, or,
- the transforming DAG as described in this design doc.
Add one point: The transform DAG can fully represent the data transform logic and its output has two forms:
- The generated code is built upon preprocessing layers or feature columns.
- A data structure (IR?) to represent the DAG. And we have a library to parse the structure and build the transform logic in memory. For tensorflow, the library is built upon preprocessing layers.
Yet, We can somehow derive the feature columns in compile-time.
@typhoonzero Yes, but it may need us to write a parser to get the whole AST of the complete SQL statement, instead of only parsing the extended syntax like TO TRAIN, TO PREDICT, etc.
Reasonable. In previous discussions with @Yancey1989 and @shendiaomo, we've been talked about this.
Yet, We can somehow derive the feature columns in compile-time.
CREATE TABLE a (...); INSERT INTO a VALUES(...); SELECT * FROM a TO TRAIN ...;
- if the data is inserted to the training table in the SQL program, we can generate feature column definitions by parsing the
INSERTstatements.- if the data is already stored in the original table, we can execute the data preparation statements with a small dataset and generate feature columns.
It will be cool that we can do more work at compile time.
But I'm afraid that we can get several necessary information for feature derivation only at runtime. For example, we have a SQL program, and then TO TRAIN statement is not the first statement. The source table for TO TRAIN is generated from the previous native SQL statements.
- Field Format:
kvorcsv, we need some sample data to check which format the field apply. The SQL to create the table and insert data need to be completed before doing this work. I'm not sure we can get the format at the compile time. - Analysis result: in the next step, we will do data analysis on the full table and complete the parameters of the transform function. The analysis work may cost time on big data set and it's suitable to execute at runtime.
Reasonable. In previous discussions with @Yancey1989 and @shendiaomo, we've been talked about this. Yet, We can somehow derive the feature columns in compile-time.
CREATE TABLE a (...); INSERT INTO a VALUES(...); SELECT * FROM a TO TRAIN ...;
- if the data is inserted to the training table in the SQL program, we can generate feature column definitions by parsing the
INSERTstatements.- if the data is already stored in the original table, we can execute the data preparation statements with a small dataset and generate feature columns.
It will be cool that we can do more work at compile time. But I'm afraid that we can get several necessary information for feature derivation only at runtime. For example, we have a SQL program, and then
TO TRAINstatement is not the first statement. The source table forTO TRAINis generated from the previous native SQL statements.
- Field Format:
kvorcsv, we need some sample data to check which format the field apply. The SQL to create the table and insert data need to be completed before doing this work. I'm not sure we can get the format at the compile time.- Analysis result: in the next step, we will do data analysis on the full table and complete the parameters of the transform function. The analysis work may cost time on big data set and it's suitable to execute at runtime.
Some work can only be done at runtime, like count the CATEGORIES of an output table of former statement, JIT is properly in this case because it's flexible. But, as discussed before, static AOT analysis is useful in some way. For example, in some cases, if we know the table schema beforehand, we can do feature derivation even if the table is not exists. In this case, static AOT analysis is the key to do feature derivation and data pre-processing concurrently. Maybe we can make more sub tasks execute concurrently by doing AOT analysis. Also, AOT analysis may be complicated but is usually low cost. So, we can dig both ways to make the system more efficient and flexible.
A real-world JIT compiler for comparison is NVRTC, which is a JIT compiler for compiling CUDA kernel codes. NVRTC is widely used in some compiler frameworks, such as TVM. TVM would use NVRTC as the CUDA compiler to compile the CUDA kernel codes during runtime.
The comparison between NVRTC/jitify and SQLFlow feature derivation stage is as follows:
NVRTC/jitify SQLFlow feature derivation stage Things to compile a code in the string form
COLUMNclause in the SQL statement Run ahead of time or just in time just in time just in time Compilation output a callable C++ object a callable feature column codes or transforming DAG
Thanks for the proposal. About 2 weeks ago, in a discussion with @brightcoder01 , we thought of a similar idea: compiling the COLUMN clause into native code, that is COLUMN -> Preprocessing Layers -> Native Code .
I agree this is a great idea, but there are still some details to be discussed:
- What' the target of the JIT compiler?
- Shall the JIT compiler generate native code, Python, or a DAG?
- How about defining an instruction set as described in https://github.com/sql-machine-learning/sqlflow/issues/2720 for both compilers.
- How to define a reasonable borderline between compilation and JIT?
- Which parts in an SQLFlow program should be compiled by the main compiler and which should be compiled by the JIT compiler.
- To some extent, the original architecture before the refactory already has a JIT mechanism: the first compiler compiles a program into YAML in compilation-time, the second compiler compiles a statement into Python in run-time.