models
models copied to clipboard
Some problems in model development
Hi, I have some issues when I tried to develop SQLFlow models:
- Analysts usually use Dataframe to manipulate data and use it as input to the Keras model. It is convenient to debug, but SQLFlow tf-codegen uses dataset, which requires additional learning costs.
- It is troublesome to connect with SQLFlow. For models configured under SQLFlow models, if you want to debug locally, you need to implement a train.py yourself, including reading data, defining feature columns, etc.. However, train.py generated locally and train.py generated by SQLFlow do not always behave consistently.
- Usually an analysis task includes feature engineering -> data preprocessing -> model training (prediction). At present, the model zoo only includes the last step. But actually, sharing model between operations is a chain that needs to share the entire data processing. I hope that SQLFlow will also have the ability to do custom data preprocessing and be included in the design of the model zoo.
@Echo9573 for the second point, I have a PR about Couler function: https://github.com/sql-machine-learning/sqlflow/pull/1208/files , which introduced a way to run the custom model on the host by the SQLFLow submitter Docker image, please take a look at it.
Hi @Echo9573, thanks for submitting this issue.
-
Converting Pandas data frame to TensorFlow dataset can be done in one function, so the conversion should be straight forward. Also, I am wondering why it is hard to debug using the dataset. In both data frame and dataset mode, we can set a breakpoint in
mymodel.call
function to debug. -
I really appreciate your consideration in integrating newly contributed models to SQLFlow. However, I am hesitated to add SQLFlow specific logic in this repo as the model definition (this repo) and runtime engine (SQLFlow, EDL, etc.) should be decoupled. In terms of testing the models, may I ask what kind of models are you trying to contribute?
- If the models are like DNNClassifier which contains only two functions
__init__
andcall
, I don't think we need to write a standalonetrain.py
to test it. You can test it usingtests/base.py
astests/test_dnnclassifier.py
did. - If the models are like DeepEmbeddingClusterModel which wraps special training logic. Then we can work together to figure out a standard training API that both
models
reop andsqlflow
repo should follow. We will materialize the standard to a base testing class liketests/base.py
. If all model tests derived from that base class pass, SQLFlow should guarantee itstrain.py
will also pass.
To sum up, models should be tested using
tests/base.py
. There is no need to writetrain.py
to test models. SQLFlow'strain.py
is developed based ontests/base.py
and SQLFLow should be responsible for integrating the models. - If the models are like DNNClassifier which contains only two functions
-
I totally agree that the sharing model only works if feature engineering is shared along with it. Is it possible to represent feature engineering in several SQL statements and share them along with the
select ... to train some_model_from_model_zoo ...
statement?
@Echo9573 doing data pre-processing using SQL is currently under the design phase, please take a look at https://github.com/sql-machine-learning/elasticdl/pull/1477 and we'll appreciate your comments and advises.