ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

SparkXShards to_spark_df only process the first element in one partition

Open cyita opened this issue 2 years ago • 7 comments

If a partition of an XShards has more than one pandas dataframe, the to_spark_df function will only process the first pdf. Error code: https://github.com/intel-analytics/BigDL/blob/main/python/orca/src/bigdl/orca/data/shard.py#L583

cyita avatar Sep 22 '22 07:09 cyita

For SparkXShards of pandas dataframe, it's by design there is only one pandas dataframe for each partition.

dding3 avatar Sep 22 '22 16:09 dding3

Is this accurate? Is it possible to have union type operations?

If this is accurate, we still need to check and report error.

jason-dai avatar Sep 22 '22 23:09 jason-dai

I think currently user can create sparkxshards of pandas dataframe in 2 ways:

  1. use orca api read_csv which will internally create rdd of pandas df and it will create 1 pandas df per partition
  2. user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?

dding3 avatar Sep 23 '22 00:09 dding3

I think currently user can create sparkxshards of pandas dataframe in 2 ways:

  1. use orca api read_csv which will internally create rdd of pandas df and it will create 1 pandas df per partition
  2. user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?

The user can always call XShards operations (e.g., transform_shard) to create a new XShards

jason-dai avatar Sep 23 '22 00:09 jason-dai

Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.

dding3 avatar Sep 23 '22 02:09 dding3

Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.

We should either support that, or report error; silent failure is bad user experience.

jason-dai avatar Sep 23 '22 02:09 jason-dai

Sure, will add check and report error.

dding3 avatar Sep 23 '22 02:09 dding3