SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date
Please answer these questions before submitting your issue. Thanks!
-
What version of Python are you using?
Python 3.10.5
-
What operating system and processor architecture are you using?
Windows-10-10.0.19044-SP0
-
What are the component versions in the environment (
pip freeze)? pipfreeze.txt -
What did you do?
import pandas as pd
from snowflake.snowpark import Session
df = pd.read_sas(source_file, format='sas7bdat', encoding='cp1252')
session.write_pandas(df, target_table, overwrite=overwrite)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152130 entries, 0 to 152129
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ProductKey 152130 non-null object
1 DistributionCenter 152130 non-null object
2 DATE 152130 non-null datetime64[ns]
3 Discount 152130 non-null float64
4 Revenue 151830 non-null float64
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 5.8+ MB
>>> df
ProductKey DistributionCenter DATE Discount Revenue
0 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-10 0.0 0.0
1 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-17 0.0 0.0
2 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-02-24 0.0 0.0
3 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-03 0.0 0.0
4 Javier's Deep Cleansing Hair Shampoo 2.718 dl Cary 2013-03-10 0.0 0.0
... ... ... ... ... ...
152125 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-15 0.0 27027.0
152126 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-02-22 0.0 28124.0
152127 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-01 0.0 25574.0
152128 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-08 0.0 25812.0
152129 Sujatha's Sensitve skin Lotion 8 dl Toronto 2015-03-15 0.0 25532.0
[152130 rows x 5 columns]
- What did you expect to see?
Expected Pandas datetime column to be written to Snowflake as a TIMESTAMP_NTZ. Dates are not loading properly.
| PRODUCTKEY | DISTRIBUTIONCENTER | DATE | DISCOUNT | REVENUE |
|---|---|---|---|---|
| Cleanser | Cary | Invalid Date | 0 | 0 |
| Shampoo | Cary | Invalid Date | 0 | 0 |
| ... | ... | ... | ... | ... |
- Can you set logging to DEBUG and collect the logs?
The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets
-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;
43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000
The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the write_pandas is evaluating as if the timedelta were in milliseconds.
>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000
select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000
For now, I have the following workaround in place, but this should not be required.
# Map datetime columns to string (needed because of bug in write_pandas)
for column in df.columns.to_list():
if is_datetime64_dtype(df[column]):
df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
linking snowflake-connector-python#600 and comment. New release of snowflake-connector-python adds use_logical_type param
closing this issue as use_logical_type is working as expected. Please open a new issue if you see more issues.
When will this paramter be available in Snowpark? The new parameter is not available in Snowpark for Python version 1.9.0 or 1.10.0 snowflake.snowpark.Session.write_pandas.
>>> session.write_pandas(df, target_table, use_logical_type=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Session.write_pandas() got an unexpected keyword argument 'use_logical_type'
>>> from snowflake.snowpark.version import VERSION as snowpark_version
>>> print(snowpark_version)
(1, 10, 0)
It is coming in the next release.
My bad. I thought this was snowflake-connector-python issue
BUMP- I'm currently running into related issues, happy to open a new issue but I assume this one is still open for a good reason. Thanks in advance for any eyes here.
Edit: Python deps
- Python 3.10.13
- pandas 2.2.2
- snowflake 0.6.0
- snowflake-connector-python 3.7.1
- snowflake-snowpark-python 1.13.0
I'm successfully passing use_logical_type to write_pandas(), but at the Snowflake level, this seems to be allowing the COPY from staged Parquet -> Snowflake to write junk into tables. Most notably, I have the following case:
- I'm taking a
datetime.datetimethat gets written to parquet withuse_logical_type. I believe this allows for parquet-specific encoding (under the hood, storing the timestamp as an integer offset from a unix epoch). - The copy command from Parquet to Snowflake results in
Invalid datebeing written to a column of typeTIMESTAMP_NTZ(9).
As a user, this is hard for me to triage because the staged parquets are difficult for me to peek into. In my opinion, the write_pandas command should not be able to write something like Invalid Date to Snowflake in a TIMESTAMP column. It'd be nice if Snowflake couldn't store that to begin with, but I assume there's a good reason for that and it's a layer deeper than the issue at hand.
Is there a good workaround for this? My current fix is just to avoid using use_logical_type altogether.