ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

classification model tutorial

Open yexinyinancy opened this issue 2 years ago • 7 comments

yexinyinancy avatar Jun 30 '22 09:06 yexinyinancy

Xshards now does not support 1) shuffle dataframe, 2) astype (data type change), 3) train_test_split, 4) duplicate whole dataframe according to one column.

yexinyinancy avatar Jun 30 '22 13:06 yexinyinancy

Xshards now does not support 1) shuffle dataframe, 2) astype (data type change), 3) train_test_split, 4) duplicate whole dataframe according to one column.

I think for train_test_split, astype, duplicate we can use transform_shard api provided in shards? Also I am wondering why we need shuffle dataframe, I think during training our optimizer will shuffle the data.

dding3 avatar Jun 30 '22 17:06 dding3

Xshards now does not support 1) shuffle dataframe, 2) astype (data type change), 3) train_test_split, 4) duplicate whole dataframe according to one column.

I think for train_test_split, astype, duplicate we can use transform_shard api provided in shards? Also I am wondering why we need shuffle dataframe, I think during training our optimizer will shuffle the data.

duplicate is not supported now. shuffle is indeed unnecessary. train_test_split and astype can be performed via transform_shard.

yexinyinancy avatar Jul 01 '22 05:07 yexinyinancy

Xshards now does not support 1) shuffle dataframe, 2) astype (data type change), 3) train_test_split, 4) duplicate whole dataframe according to one column.

I think for train_test_split, astype, duplicate we can use transform_shard api provided in shards? Also I am wondering why we need shuffle dataframe, I think during training our optimizer will shuffle the data.

But is it reasonable for the users call astype after using MinMaxScaler of LabelEncoder? MLlib types should better be hidden and the output types should better be common basic types?

hkvision avatar Jul 01 '22 05:07 hkvision

Xshards now does not support 1) shuffle dataframe, 2) astype (data type change), 3) train_test_split, 4) duplicate whole dataframe according to one column.

I think for train_test_split, astype, duplicate we can use transform_shard api provided in shards? Also I am wondering why we need shuffle dataframe, I think during training our optimizer will shuffle the data.

But is it reasonable for the users call astype after using MinMaxScaler of LabelEncoder? MLlib types should better be hidden and the output types should better be common basic types?

We should not expose implementation details (e.g., MLlib types) to the user.

jason-dai avatar Jul 01 '22 22:07 jason-dai

Updated the code to change the mllib vectors type to array, I think if it's for change the array type to pytorch tensor type, we may need use transform_shard api. astype is more like to change basic type to other basic type(eg. double to int)?

dding3 avatar Jul 02 '22 00:07 dding3

Updated the code to change the mllib vectors type to array, I think if it's for change the array type to pytorch tensor type, we may need use transform_shard api. astype is more like to change basic type to other basic type(eg. double to int)?

I think in the Estimator, we would help to convert to torch tensor, check it? @yexinyinancy

hkvision avatar Jul 04 '22 01:07 hkvision