sqlflow icon indicating copy to clipboard operation
sqlflow copied to clipboard

Think about simplifying user written COLUMN clause with feature derivation.

Open brightcoder01 opened this issue 5 years ago • 1 comments

Problem

SQLFlow COLUMN clause describes the process of transforming the data from source table into the model input. The source table may contain hundreds of or even more columns in some scenarios such as Search & Recommendation. In this case, the COLUMN clause will be too complicated for users to write.

Goal

In order to improve the user experience, we want to make it happen that user can write simpler COLUMN clause or even don't need write COLUMN clause. The training process can run smoothly and we can get a model with good performance.

Approach

For the COLUMN clause of simplified version, SQLFlow runtime will auto complete the transform logic based on it and build the full transform process through feature derivation. Just like the following example, user provide the first simplified version, SQLFlow will internally build the transform logic which is equivalent with the second one (not auto complete the SQL statement in the IDE).

-- Simplified version
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN num_1, num_2, Embedding(cat_1, 8), Embedding(cat_2, 8), Embedding(cat_3, 8)
INTO my_model

-- Auto complete version
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN num_1, num_2, Embedding(HASH(cat_1), 8), Embedding(VOCABULARIZE(cat_2), 8), Embedding(IDENTITY(cat_3), 8)
INTO my_model

For no column clause, SQLFlow will build the transform logic from scratch according to the metadata of both source data and model, we can call it automated feature engineering.

-- Statement without column clause
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
INTO my_model

-- Automated Feature Engineering
SELECT num_1, num_2, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN DNNClassifier
COLUMN STANDARDIZE(num_1), STANDARDIZE(num_2), EMBEDDING(CONCAT(HASH(cat_1), VOCABULARIZE(cat_2), IDENTITY(cat_3)), 8)
INTO my_model

Get column property from source table (table schema and data)

  • Column data type in DB storage we can get it from the table schema directly.
  • Column serialize format csv, kv, or more.
  • Column data type in memory 1 v 1 mapping from table schema. Except the case that the column of string type from source table can be parsed to numbers according to the serialize format information.
  • Column is numerical or categorical If the column type is float, it's treated as numerical. If the column type is string, it's treated as categorical. If the column type is integer and the unique count is few, it's treated as categorical, otherwise numerical

Derive the complete data transform method based on column property above

Numerical Column
  • DNN Model (TensorFlow/PyTorch): STANDARDIZE / NORMALIZE BUCKETIZE -> EMBEDDING
  • Tree Model (XGBoost): No Transform, feed into model directly
Categorical Column
  • DNN Model (TensorFlow/PyTorch): HASH / VOCABULARIZE / IDENTITY -> EMBEDDING Calculate the distinct count of the column. If count_result > threshold (such as 10000 or sqrt(total_row_num), the parameter is tunable), we will choose HASH. Otherwise, we will choose VOCABULARIZE. Q: When do we derive to use IDENTITY?

  • Tree Model (XGBoost): HASH / VOCABULARIZE / IDENTITY -> ONEHOT

Derive the data transform parameters from Data Analysis

We can refer more details from the design doc

brightcoder01 avatar Jul 23 '20 03:07 brightcoder01

Do we have to apply autoML with auto feature derivation? e.g. if we add STANDARDIZE to all numerical columns, is it affecting the model performance respectively?

typhoonzero avatar Jul 27 '20 03:07 typhoonzero