pandera icon indicating copy to clipboard operation
pandera copied to clipboard

SchemaModel.validate on PySpark with pandas_api coerces string into NaN

Open Blogem opened this issue 2 years ago • 3 comments

Describe the bug When I do a SchemaModel.validate on a PySpark pandas_api dataframe, string values are coerced into NaN. This happens when the column in the schema is defined as Column(int), yet contains string values.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pyspark.sql import SparkSession

df_pd = pd.DataFrame.from_dict(
    {
        'columnA': {0: '1', 1: '2'},
        'columnB': {0: 'B', 1: 'Y'}
    }
)


df_schema = pa.DataFrameSchema(
    {
        "columnA": pa.Column(int),
        "columnB": pa.Column(int)
    },
    coerce=True
)

spark = SparkSession \
    .builder \
    .appName("This app") \
    .getOrCreate()

df_spark = spark.createDataFrame(df_pd).pandas_api()

try:
    df_spark_parsed = df_schema.validate(df_spark, lazy=True)
except pa.errors.SchemaErrors as err:
    # errors: values of columnB gets coerced into NaN, resulting in not_nullable error (unexpected)
    f = err.failure_cases
    print(f)

# for comparison the same on the Pandas dataframe:
try:
    df_pd_parsed = df_schema.validate(df_pd, lazy=True)
except pa.errors.SchemaErrors as err:
    # errors: values of columnB can't get coerced and columnB has mismatching dtype (not int) (expected)
    f = err.failure_cases
    print(f)

Results of code sample

PySpark pandas_api dataframe
  schema_context   column         check check_number  failure_case  index
0         Column  columnB  not_nullable         None           NaN      0
1         Column  columnB  not_nullable         None           NaN      1
Regular Pandas dataframe
  schema_context   column                  check check_number failure_case  index
0         Column  columnB  coerce_dtype('int64')         None            B      0
1         Column  columnB  coerce_dtype('int64')         None            Y      1
2         Column  columnB         dtype('int64')         None       object   None

Expected behavior

I would expect the same behaviour as with the Pandas dataframe. I.e. the values cannot be coerced and an error that the column contains the wrong datatype.

Versions used

  • pandera 0.11.0
  • PySpark 3.3.0 (also tested 3.2.1)

Blogem avatar Jul 08 '22 15:07 Blogem

This is apparently Spark default behaviour: if it cannot cast a value, it returns null.

Blogem avatar Jul 08 '22 16:07 Blogem

Hi @Blogem, yes this is the behavior I encountered when testing out the pyspark integration... unfortunately I couldn't think of a nice way of handling this without keeping track of actual nulls vs uncoercible nulls... do you have any ideas?

cosmicBboy avatar Jul 20 '22 14:07 cosmicBboy

@cosmicBboy I've tried exactly that: split the dataframe column by column in null and notnull, cast all values in the column, and then test if there are nulls in the notnull dataframe (instead of splitting the dataframe, you could also keep track of counts of nulls and see if they changed). However, this is far from optimal and will not perform with even a handful of columns. The only way I could find to make it optimal is by using the map() method of the RDD API. This works fine, but is obviously completely different from the dataframe approach.

Maybe there's another way to do it while sticking with dataframes, so there's minimal changes needed in Pandera, but I haven't found it.

Blogem avatar Jul 20 '22 15:07 Blogem