snowpark-python icon indicating copy to clipboard operation
snowpark-python copied to clipboard

SNOW-1526571: Mock_iff not working if there are null values in the columns

Open frederiksteiner opened this issue 1 year ago • 0 comments

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    3.11.8

  2. What operating system and processor architecture are you using?

    Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

  3. What are the component versions in the environment (pip freeze)?

snowflake-connector-python==3.11.0 snowflake-snowpark-python==1.19.0

  1. What did you do? Start local testing session and create
from snowflake.snowpark import Session
import snowflake.snowpark.functions as spf
conn_params = {
    "schema": self.schema,
    "local_testing": True,
    "timezone": gc.timezone_local,
    **kwargs,
}

session = Session.builder.configs(conn_params).create()
import snowflake.snowpark.functions as spf
from snowflake.snowpark.window import Window
data = [
    (1, 1, 1, None),
    (1, 1, 2, None),
    (1, 1, 3, 1),
    (1, 1, 4, None),
    (1, 1, 5, None),
    (1, 1, 6, 2),
    (1, 1, 7, None),
    (1, 1, 8, None),
]
schema = ["COL1","COL2","COL3","COLUMN_TO_FILL"]

df = session.create_dataframe(
    data=data,
    schema=schema,
)
window = Window.partition_by(["COL1", "COL2"]).order_by("COL3")
lead = spf.lead(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window)
lag = spf.lag(df.col("COLUMN_TO_FILL"), ignore_nulls=True).over(window)

max_lead_lag = spf.iff(lead > lag, lead, lag)
df = df.with_column("MAX_LEAD_LAG",max_lead_lag)
result = df.to_pandas()
expected = pd.DataFrame(
    [
        (1, 1, 1, None, None),
        (1, 1, 2, None, None),
        (1, 1, 3, 1, None),
        (1, 1, 4, None, 2.0),
        (1, 1, 5, None, 2.0),
        (1, 1, 6, 2, 1.0),
        (1, 1, 7, None, 2.0),
        (1, 1, 8, None, 2.0),
    ],
    columns=schema + ["MAX_LEAD_LAG"],
)
pd.testing.assert_frame_equal(
    result,
    expected,
    check_dtype=False,
)
  1. What did you expect to see?

    That this should assert to True and not false.

The column "MAX_LEAD_LAG" is all Nones. The reason for it is that the lead and lag column are NullType columns afterwards and hence the max_lead_lag is full of Nones. Hence it is not really a problem of mock_iff but of the lead and lag testing method returning NullType columns.

@patch("iff")
def mock_iff(condition: ColumnEmulator, expr1: ColumnEmulator, expr2: ColumnEmulator):
    assert isinstance(condition.sf_type.datatype, BooleanType)

    coerce_result = get_coerce_result_type(expr1.sf_type, expr2.sf_type) #### This is none, since both lead and lag are NullTypecolumns
    if all(condition) or all(~condition) or coerce_result is not None:
        res = ColumnEmulator(data=[None] * len(condition), dtype=object)
        expr1 = cast_column_to(expr1, coerce_result) ##### This returns empty
        expr2 = cast_column_to(expr2, coerce_result) ##### This returns empty
        res.where(condition, other=expr2, inplace=True)
        res.where([not x for x in condition], other=expr1, inplace=True)
        res.sf_type = coerce_result
        return res ##### Now this is full of Nones too
    else:
        raise SnowparkLocalTestingException(
            f"[Local Testing] expr1 and expr2 have conflicting datatypes that cannot be coerced: {expr1.sf_type} <-> {expr2.sf_type}"
        )

frederiksteiner avatar Jul 09 '24 11:07 frederiksteiner