spark
spark copied to clipboard
[SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func
The input parameter of nsmallest should be validated as Integer. So I think we might miss this validation.
And PySpark will raise Error when we input the strange types into nsmallest func.
What changes were proposed in this pull request?
validate the input num is integer type only.
Why are the changes needed?
PySpark will raise Error if we not limit the type.
>>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
>>> df.groupby(['A'])['B']
<pyspark.pandas.groupby.SeriesGroupBy object at 0x7fda5a171fa0>
>>> df.groupby(['A'])['B'].nsmallest(True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in nsmallest
sdf.withColumn(temp_rank_column, F.row_number().over(window))
File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in filter
jdf = self._jdf.filter(condition._jc)
File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due to data type mismatch: differing types in '(__rank__ <= true)' (int and boolean).;
'Filter (__rank__#4995 <= true)
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L, __rank__#4995, __rank__#4995]
+- Window [row_number() windowspecdefinition(__index_level_0__#4988L, B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST]
+- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
+- Project [A#4978L AS __index_level_0__#4988L, __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, __natural_order__#4983L]
+- Project [__index_level_0__#4977L, A#4978L, B#4979L, monotonically_increasing_id() AS __natural_order__#4983L]
+- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], false
Does this PR introduce any user-facing change?
No
How was this patch tested?
input non-Integer type will raise AssersionError during calling nsmallest func
cc @itholic @xinrong-meng @zhengruifeng FYI
I think this is a more general problem:
def nsmallest(self, n: int = 5)
this method already hint the input type, but given a boolean it will fail with a AnalysisException
(which should be a TypeError
).
I guess it could accept a floating number like 5.5, and run successfully.
I did a quick investigation of typing
, there is an annotation @typing.runtime_checkable
, but only works for class typing.Protocol
is there some method to enforce the input type checking? @zero323
is there some method to enforce the input type checking? @zero323
There exist some 3rd party attempts, but as far as I'm aware none is particularly popular and well maintained (pydantic
included validate_arguments
on provisional basis, but I haven't seen it used in the wild).
Runtime checking is officially out-of-scope for core typing
(@typing.runtime_checkable
is a bit different beast) so it wouldn't expect any solution emerging there.
Personally, I'd say that the current behavior is acceptable ‒ there is no significant divergence compared to Pandas:
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
>>> df.groupby(['A'])['B'].nsmallest(5.5)
Traceback (most recent call last):
...
TypeError: cannot do positional indexing on Int64Index with these indexers [True] of type bool
and the error message is meaningful enough.
Thank you @zero323 for detailed explaination!
Can one of the admins verify this patch?
is there some method to enforce the input type checking?
I did some try on strict type check for pyspark in last year (long long ago): https://github.com/Yikun/annotation-type-checker/pull/4
We could easy to add decorator to enable type check without any 3rd lib introduced.
But if we enable this, the type check will be the first priority validation, this might bring some different behavior with pandas finally.
If it's a common mistake, we might want to add this fix, but for this patch, I personally think this example seems a little too extreme. The user could also found the error by seeing due to data type mismatch: differing types
. @bzhaoopenstack Or you have any other plus?
maybe we still need type checking in this PR:
In [1]: import pandas as pd
In [2]: import pyspark.pandas as ps
In [3]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
In [4]: df.groupby(['A'])['B'].nsmallest(5.5)
...
TypeError: cannot do positional indexing on Int64Index with these indexers [5.5] of type float
In [5]: df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', 'B'])
In [6]: df.groupby(['A'])['B'].nsmallest(5.5)
A
1 0 3
2 1 4
3 2 5
4 3 6
Name: B, dtype: int64
In [7]: df.groupby(['A'])['B'].nsmallest("5.5")
Out[7]:
A
1 0 3
2 1 4
3 2 5
4 3 6
Name: B, dtype: int64
ps
runs successfully with 5.5
and "5.5"
, whilepd
throws a TypeError
I'm not sure about @Yikun 's type_checker
annotation, but it seems simpler than manually adding isinstance
checking for each parameter.
I'm not sure about @Yikun 's
type_checker
annotation, but it seems simpler than manually addingisinstance
checking for each parameter.
The code looks checking the definition of annotation, as make it as a decorator and using typing to checking during runtime.
If it's a common mistake, we might want to add this fix, but for this patch, I personally think this example seems a little too extreme. The user could also found the error by seeing
due to data type mismatch: differing types
. @bzhaoopenstack Or you have any other plus?
That's just a demo. We can not image the developers how to type in the python session and provide a confused err-msg. Try to image when you are the developer and try to use the function in the first time, you want to try the borderline of a func. I compare the pandas codeline with pyspark, and try the best to leverage the validation gap between pandas and PySpark.
From discussion and my experience about this kind issue, the validation and early validation are necessary. The key point now I think is finding a good way to support validate during runtime. That would be good that community could consider a good/acceptable way to make this happen, due to the codetree of PySpark Pandas already contains so many sperated valiations already. So I think that would be a good improvement for PySpark Pandas. ;-)
already contains so many sperated valiations already
ok, if you wanna add checking here, please follow https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L735-L741 to raise a TypeError
Kindly asking, any update on this PR ?
IMHO, anyway the most important goal of the pandas API on Spark is to match all behaviors to pandas as much as possible, including error types and messages, to avoid confusion for existing pandas users.
So, I think at least we should raise TypeError
just same as pandas, like suggested in https://github.com/apache/spark/pull/37369#issuecomment-1205991647.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!