PySpark - Incompatible parameter type & Unsupported operand
Pyre Bug
Bug description
The pyspark dataframe functions generate Unsupported operand [58] and Incompatible parameter type [6] even though they are valid and even suggested in the Spark documentation.
Reproduction steps
Python snippet sample.py:
from pyspark.sql import SparkSession, functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.sql("select 1 as num")
(
df
.withColumn("num", f.col("num") + 2)
.withColumn("num", f.col("num") - 2)
.withColumn("num", f.col("num") * 2)
.withColumn("num", f.col("num") / 2)
.filter(f.col("num") > 1)
.filter(f.col("num") >= 1)
.filter(f.col("num") < 1)
.filter(f.col("num") <= 1)
).show()
Expected behavior
Running the pyre check should not throw any issues as the code is valid.
See the docs:
Logs
$ pyre check
ƛ Found 16 type errors!
sample.py:9:23 Unsupported operand [58]: `+` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:9:23 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.withColumn`, for 2nd positional argument, expected `Column` but got `int`.
sample.py:10:23 Unsupported operand [58]: `-` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:10:23 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.withColumn`, for 2nd positional argument, expected `Column` but got `int`.
sample.py:11:23 Unsupported operand [58]: `*` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:11:23 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.withColumn`, for 2nd positional argument, expected `Column` but got `int`.
sample.py:12:23 Unsupported operand [58]: `/` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:12:23 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.withColumn`, for 2nd positional argument, expected `Column` but got `float`.
sample.py:13:12 Unsupported operand [58]: `>` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:13:12 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.filter`, for 1st positional argument, expected `Union[Column, str]` but got `bool`.
sample.py:14:12 Unsupported operand [58]: `>=` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:14:12 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.filter`, for 1st positional argument, expected `Union[Column, str]` but got `bool`.
sample.py:15:12 Unsupported operand [58]: `<` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:15:12 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.filter`, for 1st positional argument, expected `Union[Column, str]` but got `bool`.
sample.py:16:12 Unsupported operand [58]: `<=` is not supported for operand types `pyspark.sql.column.Column` and `int`.
sample.py:16:12 Incompatible parameter type [6]: In call `pyspark.sql.dataframe.DataFrame.filter`, for 1st positional argument, expected `Union[Column, str]` but got `bool`.
As my inspection, a part(or even all) of this issue caused by pyre only treat a function defined by def __add__(): ... as a magic method of a class but Column defines its __add__ and many other magic methods like
__add__ = cast(
Callable[["Column", Union["Column", "LiteralType", "DecimalLiteral"]], "Column"],
_bin_op("plus"),
)
(see https://github.com/apache/spark/blob/187e9a851758c0e9cec11edab2bc07d6f4404001/python/pyspark/sql/column.py#L235-L274)
Here is a more understandable demo for this: source:
class MyInt:
int_val: int = 0
def real_add_func(self, other_int: int) -> int:
return self.int_val + other_int
__add__ = real_add_func
a_int: int = MyInt() + 0
pyre_playground here
Pyre gives a false positive 10:13: Unsupported operand [58]: `+` is not supported for operand types `MyInt` and `int`.