snowpark-python icon indicating copy to clipboard operation
snowpark-python copied to clipboard

SNOW-1360263: row_number window function raises a ValueError (index is not in list)

Open samuelsongsr opened this issue 1 year ago • 2 comments

  1. What version of Python are you using?

    Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

  2. What are the Snowpark Python and pandas versions in the environment?

    pandas==2.2.1 snowflake-snowpark-python==1.15.0

  3. What did you do? I'm trying to test a row_number() window function and it's raising a ValueError on this line: _plan.py

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),
    Row(id=2, row_date=date(2024, 1, 2), value=1)
]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over id and row_date and get the records with the largest values
window = Window.partition_by("id", "row_date").order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

  1. What did you expect to see?

    The results should include the following 4 records:

Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
Row(id=2, row_date=date(2024, 1, 2), value=1)

Instead of the above results, a ValueError is raised stating (1 is not in list).

samuelsongsr avatar May 02 '24 14:05 samuelsongsr

Any update on this one? I am still encountering this issue on version 1.22.1

tvdboom avatar Sep 13 '24 09:09 tvdboom

I encountered the same problem and played around with it. I found a similar example and played a bit around with it.

This one does not work:

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=3, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),

]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over row_date and id and get the records with the largest values
window = Window.partition_by(["row_date", "id"]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

But when swapping "row_date" and "id" in the partition_by, it actually works and gives no such error:

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=3, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),
]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over id and row_date and get the records with the largest values
window = Window.partition_by(["id", "row_date", ]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

frederiksteiner avatar Sep 24 '24 06:09 frederiksteiner

This should be fixed in v1.24.0

sfc-gh-jrose avatar Nov 05 '24 17:11 sfc-gh-jrose