ipex-llm SparkXShards to_spark_df only process the first element in one partition

SparkXShards to_spark_df only process the first element in one partition

Open cyita opened this issue 2 years ago • 7 comments

If a partition of an XShards has more than one pandas dataframe, the to_spark_df function will only process the first pdf. Error code: https://github.com/intel-analytics/BigDL/blob/main/python/orca/src/bigdl/orca/data/shard.py#L583

Sep 22 '22 07:09 cyita

For SparkXShards of pandas dataframe, it's by design there is only one pandas dataframe for each partition.

Sep 22 '22 16:09 dding3

Is this accurate? Is it possible to have union type operations?

If this is accurate, we still need to check and report error.

Sep 22 '22 23:09 jason-dai

I think currently user can create sparkxshards of pandas dataframe in 2 ways:

use orca api read_csv which will internally create rdd of pandas df and it will create 1 pandas df per partition
user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?

Sep 23 '22 00:09 dding3

I think currently user can create sparkxshards of pandas dataframe in 2 ways:

use orca api read_csv which will internally create rdd of pandas df and it will create 1 pandas df per partition

user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?

The user can always call XShards operations (e.g., transform_shard) to create a new XShards

Sep 23 '22 00:09 jason-dai

Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.

Sep 23 '22 02:09 dding3

Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.

We should either support that, or report error; silent failure is bad user experience.

Sep 23 '22 02:09 jason-dai

Sure, will add check and report error.

Sep 23 '22 02:09 dding3

ipex-llm ipex-llm copied to clipboard

SparkXShards to_spark_df only process the first element in one partition

ipex-llm
ipex-llm copied to clipboard