snowpark-python icon indicating copy to clipboard operation
snowpark-python copied to clipboard

SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date

Open dwelden opened this issue 2 years ago • 8 comments

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.5

  2. What operating system and processor architecture are you using?

    Windows-10-10.0.19044-SP0

  3. What are the component versions in the environment (pip freeze)? pipfreeze.txt

  4. What did you do?

import pandas as pd
from snowflake.snowpark import Session

df = pd.read_sas(source_file, format='sas7bdat', encoding='cp1252')
session.write_pandas(df, target_table, overwrite=overwrite)

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152130 entries, 0 to 152129
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype
---  ------              --------------   -----
 0   ProductKey          152130 non-null  object
 1   DistributionCenter  152130 non-null  object
 2   DATE                152130 non-null  datetime64[ns]
 3   Discount            152130 non-null  float64
 4   Revenue             151830 non-null  float64
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 5.8+ MB

>>> df
                                           ProductKey DistributionCenter       DATE  Discount  Revenue
0       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-10       0.0      0.0
1       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-17       0.0      0.0
2       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-24       0.0      0.0
3       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-03-03       0.0      0.0
4       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-03-10       0.0      0.0
...                                               ...                ...        ...       ...      ...
152125            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-02-15       0.0  27027.0
152126            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-02-22       0.0  28124.0
152127            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-01       0.0  25574.0
152128            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-08       0.0  25812.0
152129            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-15       0.0  25532.0

[152130 rows x 5 columns]
  1. What did you expect to see?

Expected Pandas datetime column to be written to Snowflake as a TIMESTAMP_NTZ. Dates are not loading properly.

PRODUCTKEY DISTRIBUTIONCENTER DATE DISCOUNT REVENUE
Cleanser Cary Invalid Date 0 0
Shampoo Cary Invalid Date 0 0
... ... ... ... ...
  1. Can you set logging to DEBUG and collect the logs?

sas2snow2.log

dwelden avatar Aug 04 '23 14:08 dwelden

The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets

DCSKINPRODUCT.zip

-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;

43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000

The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the write_pandas is evaluating as if the timedelta were in milliseconds.

>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000

select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000

dwelden avatar Aug 04 '23 14:08 dwelden

For now, I have the following workaround in place, but this should not be required.

    # Map datetime columns to string (needed because of bug in write_pandas)
    for column in df.columns.to_list():
        if is_datetime64_dtype(df[column]):
            df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')

dwelden avatar Aug 04 '23 14:08 dwelden

linking snowflake-connector-python#600 and comment. New release of snowflake-connector-python adds use_logical_type param

ericpettengill avatar Nov 09 '23 17:11 ericpettengill

closing this issue as use_logical_type is working as expected. Please open a new issue if you see more issues.

sfc-gh-aalam avatar Nov 09 '23 19:11 sfc-gh-aalam

When will this paramter be available in Snowpark? The new parameter is not available in Snowpark for Python version 1.9.0 or 1.10.0 snowflake.snowpark.Session.write_pandas.

>>> session.write_pandas(df, target_table, use_logical_type=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Session.write_pandas() got an unexpected keyword argument 'use_logical_type'
>>> from snowflake.snowpark.version import VERSION as snowpark_version
>>> print(snowpark_version)
(1, 10, 0)

dwelden avatar Nov 14 '23 14:11 dwelden

It is coming in the next release.

sfc-gh-aalam avatar Nov 14 '23 18:11 sfc-gh-aalam

My bad. I thought this was snowflake-connector-python issue

sfc-gh-aalam avatar Nov 14 '23 18:11 sfc-gh-aalam

BUMP- I'm currently running into related issues, happy to open a new issue but I assume this one is still open for a good reason. Thanks in advance for any eyes here.

Edit: Python deps

  • Python 3.10.13
  • pandas 2.2.2
  • snowflake 0.6.0
  • snowflake-connector-python 3.7.1
  • snowflake-snowpark-python 1.13.0

I'm successfully passing use_logical_type to write_pandas(), but at the Snowflake level, this seems to be allowing the COPY from staged Parquet -> Snowflake to write junk into tables. Most notably, I have the following case:

  • I'm taking a datetime.datetime that gets written to parquet with use_logical_type. I believe this allows for parquet-specific encoding (under the hood, storing the timestamp as an integer offset from a unix epoch).
  • The copy command from Parquet to Snowflake results in Invalid date being written to a column of type TIMESTAMP_NTZ(9).

As a user, this is hard for me to triage because the staged parquets are difficult for me to peek into. In my opinion, the write_pandas command should not be able to write something like Invalid Date to Snowflake in a TIMESTAMP column. It'd be nice if Snowflake couldn't store that to begin with, but I assume there's a good reason for that and it's a layer deeper than the issue at hand.

Is there a good workaround for this? My current fix is just to avoid using use_logical_type altogether.

ameryisafreeelf avatar Jul 31 '24 21:07 ameryisafreeelf