Added timezone type to dfs when the corresponding pd Df also has timezones
Please answer these questions before submitting your pull requests. Thanks!
-
What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes #1952
-
Fill out the following pre-review checklist:
- [ ] I am adding a new automated test(s) to verify correctness of my new code
- [ ] I am adding new logging messages
- [ ] I am adding a new telemetry message
- [ ] I am modifying authorization mechanisms
- [ ] I am adding new credentials
- [ ] I am modifying OCSP code
- [ ] I am adding a new dependency
-
Please describe how your code solves the related issue.
Checks if there are any tz_columns and changes the column mapping such that the resulting type is a timezone type
Didn‘t manage to install tox on my device, so some checks might be failing :/
Isn't that to be expected? Aren't types per column and hence it cannot be done correclty anyways. But I think it gets handled correctly when saving to parquet:
import pytz
import pandas as pd
import pyarrow.parquet as pq
from datetime import datetime
df = pd.DataFrame({"DT": [
datetime.now(tz=pytz.timezone("Europe/Amsterdam")),
datetime.now(tz=pytz.timezone("UTC")),
]})
df.to_parquet("test.parquet")
pq.read_schema("test.parquet")
This gives the following output: DT: timestamp[us, tz=Europe/Amsterdam] -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 366 And the resulting dataframe looks as follows if re reading it: DT 0 2024-07-01 08:54:18.603081+02:00 1 2024-07-01 08:54:18.603099+02:00 with dtypes: DT datetime64[us, Europe/Amsterdam].
@pytest.mark.parametrize("use_logical_type", [None, True, False])
def test_write_pandas_use_logical_type(
conn_cnx: Callable[..., Generator[SnowflakeConnection, None, None]],
use_logical_type: bool | None,
):
table_name = random_string(5, "USE_LOCAL_TYPE_").upper()
col_name = "DT"
create_sql = f"CREATE OR REPLACE TABLE {table_name} ({col_name} TIMESTAMP_TZ)"
select_sql = f"SELECT * FROM {table_name}"
drop_sql = f"DROP TABLE IF EXISTS {table_name}"
timestamp = datetime(
year=2020,
month=1,
day=2,
hour=3,
minute=4,
second=5,
microsecond=6,
tzinfo=timezone(timedelta(hours=2)),
)
###################### changed/new lines start
timestamp_2 = datetime(
year=2020,
month=1,
day=2,
hour=3,
minute=4,
second=5,
microsecond=6,
tzinfo=timezone(timedelta(hours=4)),
)
df_write = pandas.DataFrame({col_name: [timestamp, timestamp_2]})
#####changed lines end
with conn_cnx() as cnx: # type: SnowflakeConnection
cnx.cursor().execute(create_sql).fetchall()
write_pandas_kwargs = dict(
conn=cnx,
df=df_write,
use_logical_type=use_logical_type,
auto_create_table=False,
table_name=table_name,
)
try:
# When use_logical_type = True, datetimes with timestamps should be
# correctly written to Snowflake.
if use_logical_type:
write_pandas(**write_pandas_kwargs)
df_read = cnx.cursor().execute(select_sql).fetch_pandas_all()
assert all(df_write == df_read)
assert pandas.api.types.is_datetime64tz_dtype(df_read[col_name])
# For other use_logical_type values, a UserWarning should be displayed.
else:
with pytest.warns(UserWarning, match="Dataframe contains a datetime.*"):
write_pandas(**write_pandas_kwargs)
finally:
cnx.execute_string(drop_sql)
When adding a similar example as yours to the unittest, then everything works when setting use_logical_type=True. But not when it is None or False, since then the check here fails.
Hence this check probably needs adaptation.