sqlflow
sqlflow copied to clipboard
[Discussion] SQLFlow statement into steps.
A SQLFlow statement can be generated into one or more couler steps. We are listing what each step does and what image it relies on here:
TO TRAIN:
| step | image | data access |
|---|---|---|
| SELECT Data from source table into Temp table | step image | SQL |
| 数据探查(对于categorical column,计算distinct count)and feature derivation (如果distinct_count比较小试用VOCABULARIZE进行id化,如果比较大,则使用Hash等等) | step image | SQL: DESCRIBE for table schema, Aggregation SQL |
| Data Auto Analysis | step image | SQL |
| Generate transform code and submit training | model image from model zoo | PAI-TF(inside MaxCompute), ElasticDL(read into K8S) |
TO PREDICT:
| step | image | data access |
|---|---|---|
| SELECT Data from source table into Temp table | step image | SQL |
| Use the transform code from training and submit Prediction Job | model image from model zoo | PAI-TF(inside MaxCompute), ElasticDL(read into K8S) |
TO EVALUATE:
| step | image | data access |
|---|---|---|
| SELECT Data from source table into Temp table | step image | SQL |
| Use the transform code from training and submit Evaluation Job | model image from model zoo | PAI-TF(inside MaxCompute), ElasticDL(read into K8S) |
TO RUN:
| step | image | data access |
|---|---|---|
| SELECT Data from source table into Temp table | step image | SQL |
| Run the python function in docker directly (Kubernetes) / Submit a PyODPS Task via Alisa (MaxCompute) | function image from model zoo | Kubernetes(need access), PyODPS task(inside MaxCompute) |
Discussion points:
- Should we differentiate the step image by Platforms? For example, one step image for MaxCompute and one for kubernetes. - proposed by @typhoonzero
- Should model / function images from model zoo be platform agnostic?
Note: The model / function image from model zoo not only contains the model/function code and its dependency for execution, but also contains the modules to submit job to different platforms - such as
goalisafor MaxCompute,kubectlandelasticdl clientfor Kubernetes.
Deployment Platform vs Step (Continous Improvement)
| Google Cloud | Amazon AWS | Alibaba MaxCompute | Travis CI | Vanilla Kubernetes | |
|---|---|---|---|---|---|
Store SELECT FROM into a temp table |
step image(BigQuery) | step image(goalisa) | |||
| Data Exploration + Feature Derivation | step image(BigQuery) | step image(goalisa) | |||
| Data Analysis | step image(BigQuery) | step image(goalisa) | |||
| Generate Transform code + Submit Training | model image |
We should clean up the temp at the end of each step, so we need three steps at leas for each job:
- create a temp table
- submit a job
- clean up the temp table
Another question, how do we confirm the clean up step would be scheduled if the job failed?
I am not sure if we could formalize the execution of the TO TRAIN statement into a specific number n of steps.
However, I have an idea that might help us figure this out: let us draw a 2-dimensional table -- one dimension includes various deployments (on Google Cloud, Amazon AWS, Alibaba internal, Travis CI, etc), and the other one includes the proposed steps. In each cell, we need to figure out the Docker image to be executed by a step in the given deployment.
We might add more dimensional like various DBMSes, AI platforms, model definition APIs (TensorFlow/Keras, PyTorch, XGBoost), data readers (MaxCompute multi-tunnel reader, local filesystem reader, etc).
We also need to take the model zoo design into consideration because it seems that at least one of the steps to execute the TO TRAIN statement needs to train a model defined in a zoo.
The user can choose to use different base images to adapt his model definition, e.g. tf-2.0, tf-1.15, pytorch-1.0 etc. SQLFlow provides a list of base images for users to choose, SQLFlow guarantees that each base image should work on every execution engine including locally run, ElasticDL, PAI etc.
For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:
- MySQL/Hive: We don't create the temporary table.
- MaxCompute: We create the temporary table.
Question: Can we make the behavior unified? Write the result of SELECT * FROM into a temporary storage(table) for all the DBMS?
For the first step of all the statements
SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:
- MySQL/Hive: We don't create the temporary table.
- MaxCompute: We create the temporary table.
Addition: It's no matter with the DBMS, but up to the AI engine.
- Should we differentiate the step image by Platforms?
- Should model / function images from model zoo be platform agnostic?
For these two questions above, the discussion notes:
- Yes. Different step images for various deployment platforms (one image for one platform): Vanilla Kubernetes, Alibaba internal, Google Cloud and so on.
- No. Different model zoo service for various deployment platforms. For the same keras model definition such as
DNNClassifier, we will build different model images with different base images for various deployment platforms. For example, the base image for Vanilla Kubernetes contains elasticdl, and the base image for Alibaba internal contains goalisa.
We should clean up the temp at the end of each step, so we need three steps at leas for each job:
- create a temp table
- submit a job
- clean up the temp table
Another question, how do we confirm the
clean up stepwould be scheduled if the job failed?
We can set the lifecycle for the temp table.