Time zone lost on csv write and can't be read without pyarrow
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of polars.
Issue Description
Upon writing a data frame with a UTC date time column, the timezone is not preserved with the typical ISO 8601 format. Other data frame libraries like pandas preserve it and also parse it correctly upon reading a CSV with that format.
Reproducible Example
import pandas as pd
import datetime
sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
pd.Series(sample).to_csv('pd-tz.csv')
pd.read_csv('pd-tz.csv')
# med: 0 0
# 0 0 2022-01-01 23:23:00+00:00
import polars as pl
df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('UTC'))
# shape: (1, 1)
# ┌─────────────────────────┐
# │ column_0 │
# │ --- │
# │ datetime[μs, UTC] │
# ╞═════════════════════════╡
# │ 2022-01-01 23:23:00 │
# └─────────────────────────┘
df.write_csv('pl-tz.csv')
pl.read_csv('pl-tz.csv', parse_dates=True)
# shape: (1, 1)
# ┌─────────────────────────┐
# │ column_0 │
# │ --- │
# │ datetime[μs] │
# ╞═════════════════════════╡
# │ 2022-01-01 23:23:00 │
# └─────────────────────────┘
pl.read_csv('pd-tz.csv', parse_dates=True)
# shape: (1, 2)
# ┌─────┬───────────────────────────┐
# │ ┆ 0 │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═══════════════════════════╡
# │ 0 ┆ 2022-01-01 23:23:00+00:00 │
# └─────┴───────────────────────────┘
Expected Behavior
- The timezone is persisted on write by abiding to the ISO 8601 format as it happens with pandas:
23:23:00+00:00instead of23:23:00. - The timezone is recognised upon reading the csv file when
parse_dates=True.
Installed Versions
The current csv reader and writer do not support timezones. We should either document that, or add support.
Do note, that for reading, you can set use_pyarrow=True to use the pyarrow reader, and for me that returns (your last example):
>>> pl.read_csv('pd-tz.csv', parse_dates=True, use_pyarrow=True)
shape: (1, 2)
┌─────┬─────────────────────────┐
│ ┆ 0 │
│ --- ┆ --- │
│ i64 ┆ datetime[ms, UTC] │
╞═════╪═════════════════════════╡
│ 0 ┆ 2022-01-01 23:23:00 UTC │
└─────┴─────────────────────────┘
Ok, right. And then there's still the writing. And adding pyarrow dependency just for that seems not optimal, in particular since the availability of binaries of that packages for pypy is not something to take for granted. Also, I think timezone support is a popular request.
You may want to try to experiment a bit, the csv writer does support the timezone formatting string as long as you specify it as an UTC offset:
import datetime
sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('UTC'))
df.write_csv()
'column_0\n2022-01-01T23:23:00.000000\n'
df.write_csv(datetime_format="%Y%m%d %H%M%S %z")
'column_0\n20220101 232300 +0000\n'
Suppose we take the 5h offset:
>>> df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('-05:00'))
You can't call df for display, as that expects not an offset but a string label as timezone, but you can do:
>>> df.write_csv()
'column_0\n2022-01-01T10:23:00.000000\n'
>>> df.write_csv(datetime_format="%Y%m%d %H%M%S %z")
'column_0\n20220101 102300 -0500\n'
Having said that, it is fair to say this is not really supported yet.
This should now be supported:
In [26]: import pandas as pd
...: import datetime
...:
...: sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
...: pd.Series(sample).to_csv('pd-tz.csv')
...: pd.read_csv('pd-tz.csv')
Out[26]:
Unnamed: 0 0
0 0 2022-01-01 23:23:00+00:00
In [27]: import polars as pl
...: df = pl.DataFrame(pl.Series([sample]).dt.replace_time_zone('UTC'))
In [28]: df
Out[28]:
shape: (1, 1)
┌─────────────────────────┐
│ column_0 │
│ --- │
│ datetime[μs, UTC] │
╞═════════════════════════╡
│ 2022-01-01 23:23:00 UTC │
└─────────────────────────┘
In [29]: df.write_csv('pl-tz.csv')
In [30]: pl.read_csv('pl-tz.csv', try_parse_dates=True)
Out[30]:
shape: (1, 1)
┌─────────────────────────────────┐
│ column_0 │
│ --- │
│ str │
╞═════════════════════════════════╡
│ 2022-01-01T23:23:00.000000+0000 │
└─────────────────────────────────┘
so I think the issue can be closed?
Other data frame libraries like pandas preserve it and also parse it correctly upon reading a CSV with that format.
for completeness - pandas wouldn't have parsed it in the example you provided, it would've stayed as dtype object
In [32]: pd.read_csv('pd-tz.csv').dtypes
Out[32]:
Unnamed: 0 int64
0 object
dtype: object
you'd have had to do
In [38]: pd.read_csv('pd-tz.csv', parse_dates=[1]).dtypes
Out[38]:
Unnamed: 0 int64
0 datetime64[ns, UTC]
dtype: object
to get it to parse
Given no follow-up from the OP, and that this functionality has been added, closing.