polars icon indicating copy to clipboard operation
polars copied to clipboard

Time zone lost on csv write and can't be read without pyarrow

Open lorenzwalthert opened this issue 3 years ago • 3 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of polars.

Issue Description

Upon writing a data frame with a UTC date time column, the timezone is not preserved with the typical ISO 8601 format. Other data frame libraries like pandas preserve it and also parse it correctly upon reading a CSV with that format.

Reproducible Example

import pandas as pd 
import datetime 

sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
pd.Series(sample).to_csv('pd-tz.csv')
pd.read_csv('pd-tz.csv')
# med: 0                          0
# 0           0  2022-01-01 23:23:00+00:00


import polars as pl
df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('UTC'))
# shape: (1, 1)
# ┌─────────────────────────┐
# │ column_0                │
# │ ---                     │
# │ datetime[μs, UTC]       │
# ╞═════════════════════════╡
# │ 2022-01-01 23:23:00     │
# └─────────────────────────┘

df.write_csv('pl-tz.csv')
pl.read_csv('pl-tz.csv', parse_dates=True)
# shape: (1, 1)
# ┌─────────────────────────┐
# │ column_0                │
# │ ---                     │
# │ datetime[μs]            │
# ╞═════════════════════════╡
# │ 2022-01-01 23:23:00     │
# └─────────────────────────┘


pl.read_csv('pd-tz.csv', parse_dates=True)
# shape: (1, 2)
# ┌─────┬───────────────────────────┐
# │     ┆ 0                         │
# │ --- ┆ ---                       │
# │ i64 ┆ str                       │
# ╞═════╪═══════════════════════════╡
# │ 0   ┆ 2022-01-01 23:23:00+00:00 │
# └─────┴───────────────────────────┘

Expected Behavior

  1. The timezone is persisted on write by abiding to the ISO 8601 format as it happens with pandas: 23:23:00+00:00 instead of 23:23:00.
  2. The timezone is recognised upon reading the csv file when parse_dates=True.

Installed Versions

---Version info--- Polars: 0.14.9 Index type: UInt32 Platform: macOS-12.5.1-arm64-arm-64bit Python: 3.9.12 (v3.9.12:b28265d7e6, Mar 23 2022, 18:22:40) [Clang 13.0.0 (clang-1300.0.29.30)] ---Optional dependencies--- pyarrow: pandas: 1.4.3 numpy: 1.23.2 fsspec: connectorx: xlsx2csv: pytz: 2022.2.1

lorenzwalthert avatar Sep 08 '22 08:09 lorenzwalthert

The current csv reader and writer do not support timezones. We should either document that, or add support.

Do note, that for reading, you can set use_pyarrow=True to use the pyarrow reader, and for me that returns (your last example):

>>> pl.read_csv('pd-tz.csv', parse_dates=True, use_pyarrow=True)
shape: (1, 2)
┌─────┬─────────────────────────┐
│     ┆ 0                       │
│ --- ┆ ---                     │
│ i64 ┆ datetime[ms, UTC]       │
╞═════╪═════════════════════════╡
│ 0   ┆ 2022-01-01 23:23:00 UTC │
└─────┴─────────────────────────┘

zundertj avatar Sep 10 '22 06:09 zundertj

Ok, right. And then there's still the writing. And adding pyarrow dependency just for that seems not optimal, in particular since the availability of binaries of that packages for pypy is not something to take for granted. Also, I think timezone support is a popular request.

lorenzwalthert avatar Sep 10 '22 11:09 lorenzwalthert

You may want to try to experiment a bit, the csv writer does support the timezone formatting string as long as you specify it as an UTC offset:

import datetime 
sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('UTC'))
df.write_csv()
'column_0\n2022-01-01T23:23:00.000000\n'
df.write_csv(datetime_format="%Y%m%d %H%M%S %z")
'column_0\n20220101 232300 +0000\n'

Suppose we take the 5h offset:

>>> df = pl.DataFrame(pl.Series([sample]).dt.with_time_zone('-05:00'))

You can't call df for display, as that expects not an offset but a string label as timezone, but you can do:

>>> df.write_csv()
'column_0\n2022-01-01T10:23:00.000000\n'
>>> df.write_csv(datetime_format="%Y%m%d %H%M%S %z")
'column_0\n20220101 102300 -0500\n'

Having said that, it is fair to say this is not really supported yet.

zundertj avatar Sep 11 '22 06:09 zundertj

This should now be supported:

In [26]: import pandas as pd
    ...: import datetime
    ...:
    ...: sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
    ...: pd.Series(sample).to_csv('pd-tz.csv')
    ...: pd.read_csv('pd-tz.csv')
Out[26]:
   Unnamed: 0                          0
0           0  2022-01-01 23:23:00+00:00

In [27]: import polars as pl
    ...: df = pl.DataFrame(pl.Series([sample]).dt.replace_time_zone('UTC'))

In [28]: df
Out[28]:
shape: (1, 1)
┌─────────────────────────┐
│ column_0                │
│ ---                     │
│ datetime[μs, UTC]       │
╞═════════════════════════╡
│ 2022-01-01 23:23:00 UTC │
└─────────────────────────┘

In [29]: df.write_csv('pl-tz.csv')

In [30]: pl.read_csv('pl-tz.csv', try_parse_dates=True)
Out[30]:
shape: (1, 1)
┌─────────────────────────────────┐
│ column_0                        │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ 2022-01-01T23:23:00.000000+0000 │
└─────────────────────────────────┘

so I think the issue can be closed?

MarcoGorelli avatar Feb 20 '23 21:02 MarcoGorelli

Other data frame libraries like pandas preserve it and also parse it correctly upon reading a CSV with that format.

for completeness - pandas wouldn't have parsed it in the example you provided, it would've stayed as dtype object

In [32]: pd.read_csv('pd-tz.csv').dtypes
Out[32]:
Unnamed: 0     int64
0             object
dtype: object

you'd have had to do

In [38]: pd.read_csv('pd-tz.csv', parse_dates=[1]).dtypes
Out[38]:
Unnamed: 0                  int64
0             datetime64[ns, UTC]
dtype: object

to get it to parse

MarcoGorelli avatar Feb 20 '23 21:02 MarcoGorelli

Given no follow-up from the OP, and that this functionality has been added, closing.

zundertj avatar Mar 07 '23 20:03 zundertj