spark
spark copied to clipboard
[SPARK-39822][PYTHON][PS] Provide a good feedback to users
Provide a graceful error msg to users when they build Index with different dtypes.
What changes were proposed in this pull request?
Raise a graceful error when users create Index with different dtypes.
Why are the changes needed?
Pandas
>>> import pandas as pd
>>> pd.Index([1,2,'3',4])
Index([1, 2, '3', 4], dtype='object')
>>>
Pyspark
Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
Spark context Web UI available at http://172.25.179.45:4042
Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
>>> ps.Index([1,2,'3',4])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
ps.from_pandas(
File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
return DataFrame(pd.DataFrame(index=pobj)).index
File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
internal = InternalFrame.from_pandas(pdf)
File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64
Users might don't know how to fix it, as the behavior is already different.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Simply create the Index with different dtypes.
cc @zhengruifeng @xinrong-meng @itholic FYI
How about we include Series and DataFrame in this PR as well since they all rely on infer_pd_series_spark_type
?
Thanks for working on error improvement of pandas API on Spark!
We have https://issues.apache.org/jira/browse/SPARK-39581 as an umbrella to track all relevant tickets.
Would you like to link your JIRA ticket to that? I can help as well.
Can one of the admins verify this patch?
How about we include Series and DataFrame in this PR as well since they all rely on
infer_pd_series_spark_type
?
You mean including the said UTs? Or make this one more common on the error msg? ;-).
Thanks for working on error improvement of pandas API on Spark!
We have https://issues.apache.org/jira/browse/SPARK-39581 as an umbrella to track all relevant tickets.
Would you like to link your JIRA ticket to that? I can help as well.
Sure, happy to do that and work with you.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
Please correct me if I am wrong, but @bzhaoopenstack, from what I can see, all comments have been addressed on this PR. Could we merge this change to master @itholic @xinrong-meng
Yeah, I think the change is good. @bzhaoopenstack sorry I think it slipped through my fingers. Mind updating this please?
@bzhaoopenstack will you reopen this? If not, can I open a new PR with yours code and add you as co-writer?