snowpark-python icon indicating copy to clipboard operation
snowpark-python copied to clipboard

SNOW-1367643: When/Otherwise throws a KeyError depending on the order of the select vs where clause

Open samuelsongsr opened this issue 1 year ago • 0 comments

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

  2. What are the Snowpark Python and pandas versions in the environment?

    pandas==2.2.1 snowflake-snowpark-python==1.15.0

  3. What did you do? I'm trying to test a when().otherwise() operation and it throws a KeyError exception if the where() clause is placed before the select() clause.

import snowflake.snowpark as sp
import snowflake.snowpark.functions as sf
import snowflake.snowpark.types as st


mock_session = sp.Session.builder.config("local_testing", True).create()

inputs = mock_session.create_dataframe([
    sp.Row(ID=1, FIRST_NAME="Homer", LAST_NAME="Simpson"),
    sp.Row(ID=2, FIRST_NAME="Marge", LAST_NAME="Simpson"),
    sp.Row(ID=3, FIRST_NAME="Clark", LAST_NAME="Kent"),
    sp.Row(ID=4, FIRST_NAME="Bruce", LAST_NAME="Wayne")
])

# Note: if you place the select(...) before the where(...) call, this works.

df = inputs.where(
    (sf.col("ID") < 10) &
    (sf.col("FIRST_NAME").in_("Marge", "Clark"))
).select(
    sf.col("ID"),
    sf.col("FIRST_NAME"),
    sf.when(
        sf.col("LAST_NAME").in_("Kent", "Bruce"),
        sf.lit("---REDACTED---").astype(st.StringType())
    ).otherwise(
        sf.col("LAST_NAME")
    ).alias("LAST_NAME")
)

results = df.collect()

KeyError: [2] not in index

As noted in the comment, changing the order of operations by calling select(...) before where(...) works. However, this won't work for all use cases since there are cases where a select doesn't include a column that's filtered on. For example, given the inputs above:

# works in Snowpark connected to our Snowflake instance, does not work under test.
df = inputs.select(
    sf.col("FIRST_NAME"),
    sf.col("LAST_NAME")
).where(
    (sf.col("ID") == 1)
)

results = df.collect()

KeyError: '"ID"'
  1. What did you expect to see?

I expected two results:

[Row(ID=2, FIRST_NAME='Marge', LAST_NAME='Simpson'),
 Row(ID=3, FIRST_NAME='Clark', LAST_NAME='---REDACTED---')]

samuelsongsr avatar May 06 '24 13:05 samuelsongsr