sqlflow [Discussion] SQLFlow statement into steps.

A SQLFlow statement can be generated into one or more couler steps. We are listing what each step does and what image it relies on here:

TO TRAIN:

step	image	data access
SELECT Data from source table into Temp table	step image	SQL
数据探查（对于categorical column，计算distinct count）and feature derivation （如果distinct_count比较小试用VOCABULARIZE进行id化，如果比较大，则使用Hash等等）	step image	SQL: `DESCRIBE` for table schema, Aggregation SQL
Data Auto Analysis	step image	SQL
Generate transform code and submit training	model image from model zoo	PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO PREDICT:

step	image	data access
SELECT Data from source table into Temp table	step image	SQL
Use the transform code from training and submit Prediction Job	model image from model zoo	PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO EVALUATE:

step	image	data access
SELECT Data from source table into Temp table	step image	SQL
Use the transform code from training and submit Evaluation Job	model image from model zoo	PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO RUN:

step	image	data access
SELECT Data from source table into Temp table	step image	SQL
Run the python function in docker directly (Kubernetes) / Submit a PyODPS Task via Alisa (MaxCompute)	function image from model zoo	Kubernetes(need access), PyODPS task(inside MaxCompute)

Discussion points:

Should we differentiate the step image by Platforms? For example, one step image for MaxCompute and one for kubernetes. - proposed by @typhoonzero
Should model / function images from model zoo be platform agnostic? Note: The model / function image from model zoo not only contains the model/function code and its dependency for execution, but also contains the modules to submit job to different platforms - such as goalisa for MaxCompute, kubectl and elasticdl client for Kubernetes.

Deployment Platform vs Step (Continous Improvement)

	Google Cloud	Alibaba MaxCompute
Store `SELECT FROM` into a temp table	step image(BigQuery)	step image(goalisa)
Data Exploration + Feature Derivation	step image(BigQuery)	step image(goalisa)
Data Analysis	step image(BigQuery)	step image(goalisa)
Generate Transform code + Submit Training		model image

May 25 '20 02:05 brightcoder01

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

create a temp table
submit a job
clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

May 25 '20 05:05 Yancey0623

I am not sure if we could formalize the execution of the TO TRAIN statement into a specific number n of steps.

However, I have an idea that might help us figure this out: let us draw a 2-dimensional table -- one dimension includes various deployments (on Google Cloud, Amazon AWS, Alibaba internal, Travis CI, etc), and the other one includes the proposed steps. In each cell, we need to figure out the Docker image to be executed by a step in the given deployment.

We might add more dimensional like various DBMSes, AI platforms, model definition APIs (TensorFlow/Keras, PyTorch, XGBoost), data readers (MaxCompute multi-tunnel reader, local filesystem reader, etc).

We also need to take the model zoo design into consideration because it seems that at least one of the steps to execute the TO TRAIN statement needs to train a model defined in a zoo.

May 25 '20 06:05 wangkuiyi

The user can choose to use different base images to adapt his model definition, e.g. tf-2.0, tf-1.15, pytorch-1.0 etc. SQLFlow provides a list of base images for users to choose, SQLFlow guarantees that each base image should work on every execution engine including locally run, ElasticDL, PAI etc.

May 25 '20 06:05 typhoonzero

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

MySQL/Hive: We don't create the temporary table.
MaxCompute: We create the temporary table.

Question: Can we make the behavior unified? Write the result of SELECT * FROM into a temporary storage(table) for all the DBMS?

May 25 '20 06:05 brightcoder01

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

MySQL/Hive: We don't create the temporary table.

MaxCompute: We create the temporary table.

Addition: It's no matter with the DBMS, but up to the AI engine.

May 25 '20 06:05 weiguoz

Should we differentiate the step image by Platforms?

Should model / function images from model zoo be platform agnostic?

For these two questions above, the discussion notes:

Yes. Different step images for various deployment platforms (one image for one platform): Vanilla Kubernetes, Alibaba internal, Google Cloud and so on.
No. Different model zoo service for various deployment platforms. For the same keras model definition such as DNNClassifier, we will build different model images with different base images for various deployment platforms. For example, the base image for Vanilla Kubernetes contains elasticdl, and the base image for Alibaba internal contains goalisa.

May 25 '20 06:05 brightcoder01

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

create a temp table

submit a job

clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

We can set the lifecycle for the temp table.

May 25 '20 08:05 workingloong

sqlflow sqlflow copied to clipboard

[Discussion] SQLFlow statement into steps.

sqlflow
sqlflow copied to clipboard