sqlflow
sqlflow copied to clipboard
Think about simplifying user written COLUMN clause with feature derivation.
Problem
SQLFlow COLUMN clause describes the process of transforming the data from source table into the model input. The source table may contain hundreds of or even more columns in some scenarios such as Search & Recommendation. In this case, the COLUMN clause will be too complicated for users to write.
Goal
In order to improve the user experience, we want to make it happen that user can write simpler COLUMN clause or even don't need write COLUMN clause. The training process can run smoothly and we can get a model with good performance.
Approach
For the COLUMN clause of simplified version, SQLFlow runtime will auto complete the transform logic based on it and build the full transform process through feature derivation. Just like the following example, user provide the first simplified version, SQLFlow will internally build the transform logic which is equivalent with the second one (not auto complete the SQL statement in the IDE).
-- Simplified version
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN num_1, num_2, Embedding(cat_1, 8), Embedding(cat_2, 8), Embedding(cat_3, 8)
INTO my_model
-- Auto complete version
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN num_1, num_2, Embedding(HASH(cat_1), 8), Embedding(VOCABULARIZE(cat_2), 8), Embedding(IDENTITY(cat_3), 8)
INTO my_model
For no column clause, SQLFlow will build the transform logic from scratch according to the metadata of both source data and model, we can call it automated feature engineering.
-- Statement without column clause
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
INTO my_model
-- Automated Feature Engineering
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN STANDARDIZE(num_1), STANDARDIZE(num_2), EMBEDDING(CONCAT(HASH(cat_1), VOCABULARIZE(cat_2), IDENTITY(cat_3)), 8)
INTO my_model
Get column property from source table (table schema and data)
- Column data type in DB storage we can get it from the table schema directly.
- Column serialize format csv, kv, or more.
- Column data type in memory 1 v 1 mapping from table schema. Except the case that the column of string type from source table can be parsed to numbers according to the serialize format information.
- Column is numerical or categorical If the column type is float, it's treated as numerical. If the column type is string, it's treated as categorical. If the column type is integer and the unique count is few, it's treated as categorical, otherwise numerical
Derive the complete data transform method based on column property above
Numerical Column
- DNN Model (TensorFlow/PyTorch): STANDARDIZE / NORMALIZE BUCKETIZE -> EMBEDDING
- Tree Model (XGBoost): No Transform, feed into model directly
Categorical Column
-
DNN Model (TensorFlow/PyTorch): HASH / VOCABULARIZE / IDENTITY -> EMBEDDING Calculate the distinct count of the column. If count_result > threshold (such as 10000 or sqrt(total_row_num), the parameter is tunable), we will choose
HASH. Otherwise, we will chooseVOCABULARIZE. Q: When do we derive to use IDENTITY? -
Tree Model (XGBoost): HASH / VOCABULARIZE / IDENTITY -> ONEHOT
Derive the data transform parameters from Data Analysis
We can refer more details from the design doc
Do we have to apply autoML with auto feature derivation? e.g. if we add STANDARDIZE to all numerical columns, is it affecting the model performance respectively?