spark [SPARK-39822][PYTHON][PS] Provide a good feedback to users

Provide a graceful error msg to users when they build Index with different dtypes.

What changes were proposed in this pull request?

Raise a graceful error when users create Index with different dtypes.

Why are the changes needed?

Pandas

>>> import pandas as pd
>>> pd.Index([1,2,'3',4])
Index([1, 2, '3', 4], dtype='object')
>>>

Pyspark

Using Python version 3.8.13 (default, Jun 29 2022 11:50:19)
Spark context Web UI available at http://172.25.179.45:4042
Spark context available as 'sc' (master = local[*], app id = local-1658301116572).
SparkSession available as 'spark'.
>>> from pyspark import pandas as ps
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
>>> ps.Index([1,2,'3',4])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/spark/python/pyspark/pandas/indexes/base.py", line 184, in __new__
    ps.from_pandas(
  File "/home/spark/spark/python/pyspark/pandas/namespace.py", line 155, in from_pandas
    return DataFrame(pd.DataFrame(index=pobj)).index
  File "/home/spark/spark/python/pyspark/pandas/frame.py", line 463, in __init__
    internal = InternalFrame.from_pandas(pdf)
  File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1469, in from_pandas
    ) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
  File "/home/spark/spark/python/pyspark/pandas/internal.py", line 1570, in prepare_pandas_frame
    spark_type = infer_pd_series_spark_type(reset_index[col], dtype, prefer_timestamp_ntz)
  File "/home/spark/spark/python/pyspark/pandas/typedef/typehints.py", line 360, in infer_pd_series_spark_type
    return from_arrow_type(pa.Array.from_pandas(pser).type, prefer_timestamp_ntz)
  File "pyarrow/array.pxi", line 1033, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '3' with type str: tried to convert to int64

Users might don't know how to fix it, as the behavior is already different.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Simply create the Index with different dtypes.

Jul 20 '22 07:07 bzhaoopenstack

cc @zhengruifeng @xinrong-meng @itholic FYI

Jul 20 '22 08:07 HyukjinKwon

How about we include Series and DataFrame in this PR as well since they all rely on infer_pd_series_spark_type?

Jul 20 '22 18:07 xinrong-meng

Thanks for working on error improvement of pandas API on Spark!

We have https://issues.apache.org/jira/browse/SPARK-39581 as an umbrella to track all relevant tickets.

Would you like to link your JIRA ticket to that? I can help as well.

Jul 20 '22 19:07 xinrong-meng

Can one of the admins verify this patch?

Jul 20 '22 23:07 AmplabJenkins

How about we include Series and DataFrame in this PR as well since they all rely on infer_pd_series_spark_type?

You mean including the said UTs? Or make this one more common on the error msg? ;-).

Jul 21 '22 02:07 bzhaoopenstack

Thanks for working on error improvement of pandas API on Spark!

We have https://issues.apache.org/jira/browse/SPARK-39581 as an umbrella to track all relevant tickets.

Would you like to link your JIRA ticket to that? I can help as well.

Sure, happy to do that and work with you.

Jul 21 '22 02:07 bzhaoopenstack

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 10 '22 00:11 github-actions[bot]

Please correct me if I am wrong, but @bzhaoopenstack, from what I can see, all comments have been addressed on this PR. Could we merge this change to master @itholic @xinrong-meng

Jun 30 '23 05:06 kumarn

Yeah, I think the change is good. @bzhaoopenstack sorry I think it slipped through my fingers. Mind updating this please?

Jul 03 '23 06:07 HyukjinKwon

@bzhaoopenstack will you reopen this? If not, can I open a new PR with yours code and add you as co-writer?

Sep 22 '23 20:09 bjornjorgensen

spark spark copied to clipboard

[SPARK-39822][PYTHON][PS] Provide a good feedback to users

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard