ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Better user experience for XShards of Pandas Dataframe input for Orca Estimators

Open shanyu-sys opened this issue 1 year ago • 2 comments

Problem

For XShards of Pandas Dataframe, it is more common for several columns together serve as one model input. In current feature_cols context, each column serves as one model input. It could be cumbersome for users to convert their original pandas dataframe feature columns to one column, and make each cell contains a list or a ndarray.

Design (open to discussion)

We could use different meanings of feature_cols for Spark Dataframe and XShards of Pandas DataFrame.

  • For Spark Dataframe, each feature column should be one input to the model;
  • For XShards of Pandas Dataframe, each feature_column could be one feature and we will internally concatenate the feature columns together as one input before feeding into the model. E.g If feature_cols = ['f1", "f2", "f3", "f4"], the model should expect an input with shape of (batch_size, 4); If feature_cols = [['f1", "f2"], ["f3", "f4"]], the model should expect two inputs, each with shape of (batch_size, 2)

Related issues

#5060 https://github.com/intel-analytics/BigDL/issues/4965

shanyu-sys avatar Jul 13 '22 06:07 shanyu-sys

is this issue related: https://github.com/intel-analytics/BigDL/issues/4448?

hkvision avatar Jul 13 '22 06:07 hkvision

See https://github.com/intel-analytics/BigDL/issues/4965#issuecomment-1184515330

jason-dai avatar Jul 14 '22 14:07 jason-dai