spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func

Open bzhaoopenstack opened this issue 2 years ago • 12 comments

The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.

And PySpark will raise Error when we input the strange types into nsmallest func.

What changes were proposed in this pull request?

validate the input num is integer type only.

Why are the changes needed?

PySpark will raise Error if we not limit the type.

>>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
>>> df.groupby(['A'])['B']
<pyspark.pandas.groupby.SeriesGroupBy object at 0x7fda5a171fa0>
>>> df.groupby(['A'])['B'].nsmallest(True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
    sdf.withColumn(temp_rank_column, F.row_number().over(window))
  File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
    jdf = self._jdf.filter(condition._jc)
  File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
'Filter (__rank__#4995 <= true)
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
   +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
      +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
         +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
            +- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
               +- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
                  +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false

Does this PR introduce any user-facing change?

No

How was this patch tested?

input non-Integer type will raise AssersionError during calling nsmallest func

bzhaoopenstack avatar Aug 02 '22 02:08 bzhaoopenstack

cc @itholic @xinrong-meng @zhengruifeng FYI

HyukjinKwon avatar Aug 02 '22 03:08 HyukjinKwon

I think this is a more general problem:

def nsmallest(self, n: int = 5)

this method already hint the input type, but given a boolean it will fail with a AnalysisException (which should be a TypeError). I guess it could accept a floating number like 5.5, and run successfully.

I did a quick investigation of typing, there is an annotation @typing.runtime_checkable, but only works for class typing.Protocol

is there some method to enforce the input type checking? @zero323

zhengruifeng avatar Aug 02 '22 04:08 zhengruifeng

is there some method to enforce the input type checking? @zero323

There exist some 3rd party attempts, but as far as I'm aware none is particularly popular and well maintained (pydantic included validate_arguments on provisional basis, but I haven't seen it used in the wild).

Runtime checking is officially out-of-scope for core typing (@typing.runtime_checkable is a bit different beast) so it wouldn't expect any solution emerging there.

Personally, I'd say that the current behavior is acceptable ‒ there is no significant divergence compared to Pandas:

>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
>>> df.groupby(['A'])['B'].nsmallest(5.5)
Traceback (most recent call last):
...
TypeError: cannot do positional indexing on Int64Index with these indexers [True] of type bool

and the error message is meaningful enough.

zero323 avatar Aug 02 '22 06:08 zero323

Thank you @zero323 for detailed explaination!

zhengruifeng avatar Aug 02 '22 07:08 zhengruifeng

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 02 '22 20:08 AmplabJenkins

is there some method to enforce the input type checking?

I did some try on strict type check for pyspark in last year (long long ago): https://github.com/Yikun/annotation-type-checker/pull/4

We could easy to add decorator to enable type check without any 3rd lib introduced.

But if we enable this, the type check will be the first priority validation, this might bring some different behavior with pandas finally.

Yikun avatar Aug 03 '22 12:08 Yikun

If it's a common mistake, we might want to add this fix, but for this patch, I personally think this example seems a little too extreme. The user could also found the error by seeing due to data type mismatch: differing types. @bzhaoopenstack Or you have any other plus?

Yikun avatar Aug 03 '22 12:08 Yikun

maybe we still need type checking in this PR:

In [1]: import pandas as pd

In [2]: import pyspark.pandas as ps

In [3]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])

In [4]: df.groupby(['A'])['B'].nsmallest(5.5)
...
TypeError: cannot do positional indexing on Int64Index with these indexers [5.5] of type float

In [5]: df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])

In [6]: df.groupby(['A'])['B'].nsmallest(5.5)

A
1  0    3
2  1    4
3  2    5
4  3    6
Name: B, dtype: int64

In [7]: df.groupby(['A'])['B'].nsmallest("5.5")
Out[7]:
A
1  0    3
2  1    4
3  2    5
4  3    6
Name: B, dtype: int64

ps runs successfully with 5.5 and "5.5", whilepd throws a TypeError

zhengruifeng avatar Aug 04 '22 01:08 zhengruifeng

I'm not sure about @Yikun 's type_checker annotation, but it seems simpler than manually adding isinstance checking for each parameter.

zhengruifeng avatar Aug 04 '22 02:08 zhengruifeng

I'm not sure about @Yikun 's type_checker annotation, but it seems simpler than manually adding isinstance checking for each parameter.

The code looks checking the definition of annotation, as make it as a decorator and using typing to checking during runtime.

bzhaoopenstack avatar Aug 04 '22 03:08 bzhaoopenstack

If it's a common mistake, we might want to add this fix, but for this patch, I personally think this example seems a little too extreme. The user could also found the error by seeing due to data type mismatch: differing types. @bzhaoopenstack Or you have any other plus?

That's just a demo. We can not image the developers how to type in the python session and provide a confused err-msg. Try to image when you are the developer and try to use the function in the first time, you want to try the borderline of a func. I compare the pandas codeline with pyspark, and try the best to leverage the validation gap between pandas and PySpark.

From discussion and my experience about this kind issue, the validation and early validation are necessary. The key point now I think is finding a good way to support validate during runtime. That would be good that community could consider a good/acceptable way to make this happen, due to the codetree of PySpark Pandas already contains so many sperated valiations already. So I think that would be a good improvement for PySpark Pandas. ;-)

bzhaoopenstack avatar Aug 04 '22 03:08 bzhaoopenstack

already contains so many sperated valiations already

ok, if you wanna add checking here, please follow https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L735-L741 to raise a TypeError

zhengruifeng avatar Aug 05 '22 03:08 zhengruifeng

Kindly asking, any update on this PR ?

IMHO, anyway the most important goal of the pandas API on Spark is to match all behaviors to pandas as much as possible, including error types and messages, to avoid confusion for existing pandas users.

So, I think at least we should raise TypeError just same as pandas, like suggested in https://github.com/apache/spark/pull/37369#issuecomment-1205991647.

itholic avatar Oct 31 '22 15:10 itholic

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Feb 09 '23 00:02 github-actions[bot]