snowpark-python SNOW-664934: Dataframe schemas don't persist between Pandas and Snowpark

What version of Python are you using?

Python 3.8

What operating system and processor architecture are you using?

Snowflake Python UDF

What are the component versions in the environment (pip freeze)?

Only snowflake-snowpark-python

What did you do?

Snowpark has a problem / bug that it does not maintain the types between Pandas and Snowpark, nor does it allow to manually set its schema.

For instance,

df1 = session.sql(sql).to_pandas()
df2 = session.create_dataframe(df)

The timestamp field on df1 with TimestampType has become a LongType.

I've also tried to store the schema and use it, but same results.

df1 = session.sql(sql)
df1_schema = df1.schema
df1 = df1.to_pandas()
df2 = session.create_dataframe(df, df1_schema)

This stops me from being able to write the DataFrame back to the table as it needs to be of TimestampType rather than LongType.

What did you expect to see?

The schemas persist among the conversions.

Sep 19 '22 09:09 elongl

Do you have a minimum repro for this? We'll look into it next week

Sep 22 '22 19:09 sfc-gh-jdu

@sfc-gh-jdu Just take any Timestamp column, convert it to Pandas, convert it back, and it'll change to Long type. Look at the DETECTED_AT field in the image.

By the way, other types change as well like on FAILURES, LongType --> DoubleType.

Sep 22 '22 21:09 elongl

@sfc-gh-jdu Hi! Any updates? I'd love to supply additional information if needed. Appreciate your help.

Sep 28 '22 09:09 elongl

@elongl : We are going to look into this in the next two weeks. Stay tuned!

Sep 28 '22 16:09 sfc-gh-sfan

@elongl : The timestamp column type change is due to the way INFER_SCHEMA works in snowflake. There is an ongoing effort to roll out a behavior change to address this but if you want it we could enable it for you. Feel free to send out the snowflake account information to my email shixuan(dot)fan(at)snowflake(dot)com and I could do that for you.

As for the long -> double type change, I'm not able to reproduce it locally so I'm not particularly sure if the the infer_schema change would impact that. I suspect the behavior might be impacted by the actual data. If you could share a minimal sample of data that could trigger this behavior that would be great. Alternatively we could see if enabling the timestamp fix would help this scenario as well.

Oct 04 '22 18:10 sfc-gh-sfan

Fixing it for me specifically will not do since I'm developing a package and I want it to work for other users as well. Isn't there a way to enforce the types to stick?

Oct 06 '22 21:10 elongl

It is a bit more complicated. When converting a pandas dataframe to a snowpark dataframe, we need to use parquet file as an intermediate and create a table out of it. The schema is determined by the infer_schema result from the parquet file.

But there is a potential solution to accept a schema that overrides so we skip infer_schema altogether. @sfc-gh-jdu , @sfc-gh-yixie : What do you think?

Oct 06 '22 21:10 sfc-gh-sfan

@elongl - While we are working on schema override when creating snowpark dataframe from pandas dataframe, please note that this does not fully work with timestamp until the behavior change I mentioned is rolled out (check https://github.com/snowflakedb/snowpark-python/pull/557#discussion_r993923546).

Regardless of the process of server side behavior change about parquet file new logical time, we think being able to specify schema is still useful.

Oct 12 '22 22:10 sfc-gh-sfan

Are there any updates on this by chance? This is a major blocker for our team's SnowPark implementation.

Oct 27 '22 18:10 jacob-martin-slalom

Hi @jacob-martin-slalom , there are issues related to rolling out the server side change to use new parquet logical type, for which I don't have a clear answer of what the ETA looks like (cc @sfc-gh-yuliu ).

As for schema override, that should be part of the first release in 2023: https://github.com/snowflakedb/snowpark-python/pull/557#issuecomment-1279218902.

Oct 27 '22 20:10 sfc-gh-sfan