ipex-llm
ipex-llm copied to clipboard
SparkXShards to_spark_df only process the first element in one partition
If a partition of an XShards has more than one pandas dataframe, the to_spark_df
function will only process the first pdf.
Error code: https://github.com/intel-analytics/BigDL/blob/main/python/orca/src/bigdl/orca/data/shard.py#L583
For SparkXShards of pandas dataframe, it's by design there is only one pandas dataframe for each partition.
Is this accurate? Is it possible to have union type operations?
If this is accurate, we still need to check and report error.
I think currently user can create sparkxshards of pandas dataframe in 2 ways:
- use orca api
read_csv
which will internally create rdd of pandas df and it will create 1 pandas df per partition - user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?
I think currently user can create sparkxshards of pandas dataframe in 2 ways:
- use orca api
read_csv
which will internally create rdd of pandas df and it will create 1 pandas df per partition- user call rdd mappartion operations to create sparkxshards of pandas dataframe, it seems not common?
The user can always call XShards operations (e.g., transform_shard
) to create a new XShards
Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.
Yes, user can do that. I think in that case, if users want to create multiple pandas dataframes with one partition, these dataframes may have different schema, otherwise why mutiple dataframes with same schema other than one.
We should either support that, or report error; silent failure is bad user experience.
Sure, will add check and report error.