sqlflow icon indicating copy to clipboard operation
sqlflow copied to clipboard

[Discussion] SQLFlow statement into steps.

Open brightcoder01 opened this issue 5 years ago • 7 comments

A SQLFlow statement can be generated into one or more couler steps. We are listing what each step does and what image it relies on here:

TO TRAIN:

step image data access
SELECT Data from source table into Temp table step image SQL
数据探查(对于categorical column,计算distinct count)and feature derivation (如果distinct_count比较小试用VOCABULARIZE进行id化,如果比较大,则使用Hash等等) step image SQL: DESCRIBE for table schema, Aggregation SQL
Data Auto Analysis step image SQL
Generate transform code and submit training model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO PREDICT:

step image data access
SELECT Data from source table into Temp table step image SQL
Use the transform code from training and submit Prediction Job model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO EVALUATE:

step image data access
SELECT Data from source table into Temp table step image SQL
Use the transform code from training and submit Evaluation Job model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO RUN:

step image data access
SELECT Data from source table into Temp table step image SQL
Run the python function in docker directly (Kubernetes) / Submit a PyODPS Task via Alisa (MaxCompute) function image from model zoo Kubernetes(need access), PyODPS task(inside MaxCompute)

Discussion points:

  • Should we differentiate the step image by Platforms? For example, one step image for MaxCompute and one for kubernetes. - proposed by @typhoonzero
  • Should model / function images from model zoo be platform agnostic? Note: The model / function image from model zoo not only contains the model/function code and its dependency for execution, but also contains the modules to submit job to different platforms - such as goalisa for MaxCompute, kubectl and elasticdl client for Kubernetes.

Deployment Platform vs Step (Continous Improvement)

Google Cloud Amazon AWS Alibaba MaxCompute Travis CI Vanilla Kubernetes
Store SELECT FROM into a temp table step image(BigQuery) step image(goalisa)
Data Exploration + Feature Derivation step image(BigQuery) step image(goalisa)
Data Analysis step image(BigQuery) step image(goalisa)
Generate Transform code + Submit Training model image

brightcoder01 avatar May 25 '20 02:05 brightcoder01

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

  1. create a temp table
  2. submit a job
  3. clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

Yancey0623 avatar May 25 '20 05:05 Yancey0623

I am not sure if we could formalize the execution of the TO TRAIN statement into a specific number n of steps.

However, I have an idea that might help us figure this out: let us draw a 2-dimensional table -- one dimension includes various deployments (on Google Cloud, Amazon AWS, Alibaba internal, Travis CI, etc), and the other one includes the proposed steps. In each cell, we need to figure out the Docker image to be executed by a step in the given deployment.

We might add more dimensional like various DBMSes, AI platforms, model definition APIs (TensorFlow/Keras, PyTorch, XGBoost), data readers (MaxCompute multi-tunnel reader, local filesystem reader, etc).

We also need to take the model zoo design into consideration because it seems that at least one of the steps to execute the TO TRAIN statement needs to train a model defined in a zoo.

wangkuiyi avatar May 25 '20 06:05 wangkuiyi

The user can choose to use different base images to adapt his model definition, e.g. tf-2.0, tf-1.15, pytorch-1.0 etc. SQLFlow provides a list of base images for users to choose, SQLFlow guarantees that each base image should work on every execution engine including locally run, ElasticDL, PAI etc.

typhoonzero avatar May 25 '20 06:05 typhoonzero

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

  • MySQL/Hive: We don't create the temporary table.
  • MaxCompute: We create the temporary table.

Question: Can we make the behavior unified? Write the result of SELECT * FROM into a temporary storage(table) for all the DBMS?

brightcoder01 avatar May 25 '20 06:05 brightcoder01

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

  • MySQL/Hive: We don't create the temporary table.
  • MaxCompute: We create the temporary table.

Addition: It's no matter with the DBMS, but up to the AI engine.

weiguoz avatar May 25 '20 06:05 weiguoz

  • Should we differentiate the step image by Platforms?
  • Should model / function images from model zoo be platform agnostic?

For these two questions above, the discussion notes:

  1. Yes. Different step images for various deployment platforms (one image for one platform): Vanilla Kubernetes, Alibaba internal, Google Cloud and so on.
  2. No. Different model zoo service for various deployment platforms. For the same keras model definition such as DNNClassifier, we will build different model images with different base images for various deployment platforms. For example, the base image for Vanilla Kubernetes contains elasticdl, and the base image for Alibaba internal contains goalisa.

brightcoder01 avatar May 25 '20 06:05 brightcoder01

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

  1. create a temp table
  2. submit a job
  3. clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

We can set the lifecycle for the temp table.

workingloong avatar May 25 '20 08:05 workingloong