models Some problems in model development

Hi, I have some issues when I tried to develop SQLFlow models:

Analysts usually use Dataframe to manipulate data and use it as input to the Keras model. It is convenient to debug, but SQLFlow tf-codegen uses dataset, which requires additional learning costs.
It is troublesome to connect with SQLFlow. For models configured under SQLFlow models, if you want to debug locally, you need to implement a train.py yourself, including reading data, defining feature columns, etc.. However, train.py generated locally and train.py generated by SQLFlow do not always behave consistently.
Usually an analysis task includes feature engineering -> data preprocessing -> model training (prediction). At present, the model zoo only includes the last step. But actually, sharing model between operations is a chain that needs to share the entire data processing. I hope that SQLFlow will also have the ability to do custom data preprocessing and be included in the design of the model zoo.

Nov 19 '19 08:11 Echo9573

@Echo9573 for the second point, I have a PR about Couler function: https://github.com/sql-machine-learning/sqlflow/pull/1208/files , which introduced a way to run the custom model on the host by the SQLFLow submitter Docker image, please take a look at it.

Nov 19 '19 09:11 Yancey1989

Hi @Echo9573, thanks for submitting this issue.

Converting Pandas data frame to TensorFlow dataset can be done in one function, so the conversion should be straight forward. Also, I am wondering why it is hard to debug using the dataset. In both data frame and dataset mode, we can set a breakpoint in mymodel.call function to debug.
I really appreciate your consideration in integrating newly contributed models to SQLFlow. However, I am hesitated to add SQLFlow specific logic in this repo as the model definition (this repo) and runtime engine (SQLFlow, EDL, etc.) should be decoupled. In terms of testing the models, may I ask what kind of models are you trying to contribute?
1. If the models are like DNNClassifier which contains only two functions __init__ and call, I don't think we need to write a standalone train.py to test it. You can test it using tests/base.py as tests/test_dnnclassifier.py did.
2. If the models are like DeepEmbeddingClusterModel which wraps special training logic. Then we can work together to figure out a standard training API that both models reop and sqlflow repo should follow. We will materialize the standard to a base testing class like tests/base.py. If all model tests derived from that base class pass, SQLFlow should guarantee its train.py will also pass.
To sum up, models should be tested using tests/base.py. There is no need to write train.py to test models. SQLFlow's train.py is developed based on tests/base.py and SQLFLow should be responsible for integrating the models.
I totally agree that the sharing model only works if feature engineering is shared along with it. Is it possible to represent feature engineering in several SQL statements and share them along with the select ... to train some_model_from_model_zoo ... statement?

Nov 19 '19 23:11 tonyyang-svail

@Echo9573 doing data pre-processing using SQL is currently under the design phase, please take a look at https://github.com/sql-machine-learning/elasticdl/pull/1477 and we'll appreciate your comments and advises.

Nov 22 '19 07:11 typhoonzero

models models copied to clipboard

Some problems in model development

models
models copied to clipboard