snowflake-sqlalchemy
snowflake-sqlalchemy copied to clipboard
SNOW-219884: Pandas datetime with timezone converts to `timestamp_ntz` in snowflake
Please answer these questions before submitting your issue. Thanks!
-
What version of Python are you using (
python --version
)? Python 3.8.2 -
What operating system and processor architecture are you using (
python -c 'import platform; print(platform.platform())'
)? macOS-10.15.7-x86_64-i386-64bit -
What are the component versions in the environment (
pip list
)? Package Version
asn1crypto 1.4.0 awswrangler 1.10.0 azure-common 1.1.25 azure-core 1.8.2 azure-storage-blob 12.5.0 boto3 1.15.18 botocore 1.18.18 certifi 2020.6.20 cffi 1.14.3 chardet 3.0.4 cryptography 3.2.1 idna 2.10 isodate 0.6.0 jmespath 0.10.0 msrest 0.6.19 numpy 1.19.4 oauthlib 3.1.0 oscrypto 1.2.1 packaging 20.4 pandas 1.1.4 pip 20.2.1 psycopg2-binary 2.8.6 pyarrow 2.0.0 pycparser 2.20 pycryptodomex 3.9.9 PyJWT 1.7.1 PyMySQL 0.10.1 pyOpenSSL 19.1.0 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2020.4 requests 2.23.0 requests-oauthlib 1.3.0 s3transfer 0.3.3 setuptools 50.2.0 six 1.15.0 snowflake-connector-python 2.3.5 snowflake-sqlalchemy 1.2.4 SQLAlchemy 1.3.20 sqlalchemy-redshift 0.8.1 urllib3 1.25.11 wheel 0.35.1
- What did you do? If possible, provide a recipe for reproducing the error. A complete runnable program is good.
I extract data from MySQL and constructed a pandas data frame. For example,
transaction_date created_on modified_on
0 2017-10-04 08:00:00+00:00 2018-10-05 18:56:39+08:00 2018-10-05 18:56:39+08:00
1 2017-10-04 08:00:00+00:00 2018-10-05 18:56:39+08:00 2018-10-05 18:56:39+08:00
2 2017-10-04 08:00:00+00:00 2018-10-05 18:56:39+08:00 2018-10-05 18:56:39+08:00
3 2017-10-04 08:00:00+00:00 2018-10-05 18:56:39+08:00 2018-10-05 18:56:39+08:00
4 2017-10-04 08:00:00+00:00 2018-10-05 18:56:39+08:00 2018-10-05 18:56:39+08:00
... ... ... ...
33934 2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.681785+08:00 2020-11-06 18:45:03.711735+08:00
33935 2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.685278+08:00 2020-11-06 18:45:18.441977+08:00
33936 2020-11-05 00:00:00+00:00 2020-11-06 18:45:03.688080+08:00 2020-11-06 18:45:21.579158+08:00
33937 2020-11-09 10:00:13.190445+00:00 2020-11-09 10:00:13.195200+08:00 2020-11-09 10:00:13.195237+08:00
33938 2020-11-09 10:28:57+00:00 2020-11-09 10:30:02.515862+08:00 2020-11-09 10:30:02.515884+08:00
the column types are:
transaction_date datetime64[ns, UTC]
created_on datetime64[ns, Asia/Hong_Kong]
modified_on datetime64[ns, Asia/Hong_Kong]
I used to following code to load data into snowflake:
data.to_sql(target_table, conn, if_exists='replace',index=False, method=pd_writer)
-
What did you expect to see?
TIMESTAMP_TZ
columns -
What did you see instead?
TIMESTAMP_NTZ(9)
-
Can you set logging to DEBUG and collect the logs?
import logging
import os
for logger_name in ['snowflake.sqlalchemy', 'snowflake.connector', 'botocore']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
logger.addHandler(ch)
Any possibilities I got this error?
I tried to inspect the log file of alchemy. The sql statement is
CREATE TABLE x_ap_account_transactions (
"ID" BIGINT,
"OBJECT_PK" BIGINT,
"OBJECT_REF" TEXT,
"TRANSACTION_DESC" TEXT,
"TRANSACTION_TYPE" TEXT,
"TRANSACTION_DATE" TIMESTAMP,
"ORIGINAL_FEE" FLOAT,
"ORIGINAL_CURRENCY" TEXT,
"EFFECTIVE_FEE" FLOAT,
"EFFECTIVE_CURRENCY" TEXT,
"IS_SUSPENSION_TRANSACTION" BIGINT,
"CREATED_ON" TIMESTAMP,
"MODIFIED_ON" TIMESTAMP,
"STATUS" TEXT,
"ACCOUNT_ID" BIGINT
)
It seems that the statement only creates `TIMESTAMP`. But not timestamp with timezone information?
Or should I set the timestamp mapping specify in this link: https://docs.snowflake.com/en/sql-reference/parameters.html#timestamp-type-mapping?
I think I solve it by
conn.execute("ALTER SESSION SET TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_TZ';")
conn.execute("ALTER SESSION SET TIMEZONE = 'UTC';")
setting the timestamp type mapping and timezone.
Saw the same problem using pandato_sql
. Looks like a bug as doc states that
"Timezone aware datetime columns will be written as Timestamp with timezone type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone. " https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
@lightupyiqian But it seem snowflake supports Timestamp with timezone? I expect pd_writer can handle that?
@lightupyiqian But it seem snowflake supports Timestamp with timezone? I expect pd_writer can handle that?
yeah, it should, but has other problem like not quoting identifier properly.
This bug was introduced in v1.1.18 and is also touches sqlalchemy.DateTime(timezone=True)
columns - in that case, the column is being passed as DATETIME
to the snowflake, and DATETIME
is an alias to TIMESTAMP_NTZ
in snowflake. Previous v1.1.17 logic created TIMESTAMP_TZ
column.
https://github.com/snowflakedb/snowflake-sqlalchemy/compare/v1.1.17...v1.1.18
Hi guys, this issue still persists with snowflake-connector-python==2.7.11. My dataframe column datatype is datetime64[ns, UTC], but when I use df.to_sql to write to Snowflake, result datatype is TIMESTAMP_NTZ. Any ideas how to fix this?
Having the same issue Pandas datetime64[ns, America/New_York]
becomes Snowflake TIMESTAMP_NTZ(9)
using snowpark session.write_pandas
I am also encountering the same problem.
- I have a timezone-aware Python
datetime
object:start_date = datetime.now(tz=timezone.utc)
. - I have a Snowflake procedure, which expects a
TIMESTAMP_TZ
argument. - I am calling that Snowflake procedure, using Snowpark
session.call
, passing the abovestart_date
variable as the argument. - Contrary to the documentation, the Snowflake data type becomes
TIMESTAMP_NTZ
, and when handled by the procedure, it "converts" the value (back) toTIMESTAMP_TZ
, to match the expected type, which incorrectly alters the timestamp value.
Has anyone found a true fix for this? Still persists using sqlalchemy and pdwriter
I'm having the same issue using sqlalchemy
engine and df.to_sql()
. The source data is pandas
timestamp with local time zone information. After writing to Snowflake it becomes TIMESTAMP_NTZ
type. Anyone working on this issue?
Meanwhile, I was able to get the desired data types by explicitly creating the table with column definitions before ingesting data.
Also having this issue -- pandas
dataframe with correct time zone offsets is converted to NTZ when sent to snowflake. I'm using sqlalchemy
engine and df.to_sql()
.
bump
hi and apologies for taking so long to look into this, we're going to change this going forward. checking this issue and thank you Nazarii for your reproduction on the linked issue!
unfortunately it looks like this long regression is still there :( and switching to sqlalchemy.TIMESTAMP does not help either; the query is still generated as updated_at TIMESTAMP
which is the same as the original updated_at datetime
; both are mapped to TIMESTAMP_NTZ
Snowflake datatype by default.
we're going to take a look and fix this.
until then, the workarounds already suggested in this issue are available to use:
- setting TIMESTAMP_TYPE_MAPPING to e.g.
TIMESTAMP_TZ
prior to creating the table with the timestamp column - creating the table with explicit
TIMESTAMP_TZ
column before ingesting data into it
thank you for bearing with us !
@sfc-gh-dszmolka Thanks, btw I have a PR that fixes this issue that was opened for a long time, now there are merge conflicts that should be fixed,
thanks @Nazarii for the contribution! added some reviewers from the connector team in hope that the review process can be sped up
however as you mentioned, the conflicts still need to be fixed eventually; do you think that would be something possible to do ? or even submit a new PR based on the current main? there's been probably tons of changes since the PR was submitted originally