SNOW-1360263: row_number window function raises a ValueError (index is not in list)
-
What version of Python are you using?
Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
-
What are the Snowpark Python and pandas versions in the environment?
pandas==2.2.1 snowflake-snowpark-python==1.15.0
-
What did you do? I'm trying to test a row_number() window function and it's raising a ValueError on this line: _plan.py
from datetime import date
from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col
# create some test data
data = [
Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
Row(id=2, row_date=date(2024, 1, 2), value=1)
]
# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)
# partition over id and row_date and get the records with the largest values
window = Window.partition_by("id", "row_date").order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)
results = df.collect()
-
What did you expect to see?
The results should include the following 4 records:
Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
Row(id=2, row_date=date(2024, 1, 2), value=1)
Instead of the above results, a ValueError is raised stating (1 is not in list).
Any update on this one? I am still encountering this issue on version 1.22.1
I encountered the same problem and played around with it. I found a similar example and played a bit around with it.
This one does not work:
from datetime import date
from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col
# create some test data
data = [
Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=3, row_date=date(2024, 1, 2), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
]
# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)
# partition over row_date and id and get the records with the largest values
window = Window.partition_by(["row_date", "id"]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)
results = df.collect()
But when swapping "row_date" and "id" in the partition_by, it actually works and gives no such error:
from datetime import date
from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col
# create some test data
data = [
Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=3, row_date=date(2024, 1, 2), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
]
# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)
# partition over id and row_date and get the records with the largest values
window = Window.partition_by(["id", "row_date", ]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)
results = df.collect()
This should be fixed in v1.24.0