polars icon indicating copy to clipboard operation
polars copied to clipboard

Constructing a DateTime Series does not include timezone of values

Open stinodego opened this issue 3 years ago • 14 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of polars.

Issue Description

The Series constructor does not pick up timezone information from the provided values.

Reproducible Example

import polars as pl
from datetime import datetime
import pytz

s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz="UTC")  # Includes time zone info
s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])  # Does not include time zone info

assert s1.series_equal(s2)  # Fails

Expected Behavior

I would expect the Series constructor to detect that the provided values are time zone specific, and construct the Series appropriately.

Installed Versions

---Version info--- Polars: 0.14.8 Index type: UInt32 Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.10.6 (main, Aug 15 2022, 22:17:55) [GCC 11.2.0] ---Optional dependencies--- pyarrow: 9.0.0 pandas: 1.4.4 numpy: 1.23.2 fsspec: connectorx: xlsx2csv: pytz: 2022.2.1

stinodego avatar Sep 02 '22 20:09 stinodego

Similarly, timezone info is lost like this:

import datetime 
import polars as pl 

sample = datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)
sample
#> datetime.datetime(2022, 1, 1, 23, 23, tzinfo=datetime.timezone.utc)

pl.Series([sample])[0]
#> datetime.datetime(2022, 1, 1, 23, 23)
pl.__version__
#> '0.14.9'

lorenzwalthert avatar Sep 08 '22 07:09 lorenzwalthert

This is currently by design as I am not really sure how to deal with that efficiently. Currently it is up to the caller to the set the timezone once a Series is constructed.

ritchie46 avatar Sep 08 '22 07:09 ritchie46

Thanks for the quick answer. You mean efficiency from a compute perspective? It's just a little bit unexpected if you have a timezone set on the input and that information is lost. Surely the person who sets it had an intention of preserving it? Alternatively, maybe issue a warning when a time zone is set on input?

lorenzwalthert avatar Sep 08 '22 08:09 lorenzwalthert

assert s1.series_equal(s2) # Fails

I'm confused by the report - are you saying that it fails, or that you expect it to fail? I'm not seeing any failure:

In [10]: import polars as pl
    ...: import pytz
    ...: 
    ...: s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz="UTC")  # Includes time zone info
    ...: s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])  # Does not include time zone info
    ...: 

In [11]: s1
Out[11]: 
shape: (1,)
Series: 'dt' [datetime[μs, UTC]]
[
        2001-01-01 00:00:00 UTC
]

In [12]: s2
Out[12]: 
shape: (1,)
Series: 'dt' [datetime[μs, UTC]]
[
        2001-01-01 00:00:00 UTC
]

In [13]: assert s1.series_equal(s2)

MarcoGorelli avatar Jan 19 '23 22:01 MarcoGorelli

It fails. Just verified, still raises an AssertionError.

Version info:

---Version info---
Polars: 0.15.14
Index type: UInt32
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python: 3.11.0 (main, Nov  1 2022, 09:16:00) [GCC 11.2.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.24.0
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: 0.8.1
matplotlib: <not installed>

stinodego avatar Jan 19 '23 22:01 stinodego

🤔 how odd, it doesn't raise anything for me, and we have practically the same setup

(.311venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import polars as pl
from datetime import datetime
import pytz

s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz="UTC")  # Includes time zone info
s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])  # Does not include time zone info

assert s1.series_equal(s2)  # Fails

(.311venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
(.311venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python -c 'import polars; print(polars.show_versions())'
---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python: 3.11.1 (main, Dec  7 2022, 01:11:34) [GCC 11.3.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: <not installed>
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
None

MarcoGorelli avatar Jan 20 '23 08:01 MarcoGorelli

I'm not getting any assertion error running it in a Kaggle notebook either: https://www.kaggle.com/code/marcogorelli/polars-issue-4700/notebook

Have I misunderstood something about how to run the snippet?

MarcoGorelli avatar Jan 20 '23 10:01 MarcoGorelli

I think the polars part works correctly.

The "weird" part is happening in the python astimezone part as it will convert your date to the timezone you set, but will use your local timezone. (@MarcoGorelli I assume your default timezone is UTC+0, so that is why it works for you.)

datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))
In [55]: d = datetime.datetime(2001, 1, 1)

In [56]: ?d.astimezone
Docstring: tz -> convert to local time in new timezone tz

In [49]: datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))
Out[49]: datetime.datetime(2000, 12, 31, 23, 0, tzinfo=<UTC>)

In [50]: datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("Europe/Brussels"))
Out[50]: datetime.datetime(2001, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Europe/Brussels' CET+1:00:00 STD>)

In [51]: pl.Series("dt", [datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])
Out[51]: 
shape: (1,)
Series: 'dt' [datetime[μs, UTC]]
[
	2000-12-31 23:00:00 UTC
]

In [52]: pl.Series("dt", [datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("Europe/Brussels"))])
Out[52]: 
shape: (1,)
Series: 'dt' [datetime[μs, Europe/Brussels]]
[
	2001-01-01 00:00:00 CET
]

ghuls avatar Jan 20 '23 10:01 ghuls

I assume your default timezone is UTC+0, so that is why it works for you

True (I'm in the UK), but still works for me even if I set a different timezone (which I'm most definitely not in):

In [64]: import polars as pl
    ...: import pytz
    ...:
    ...: tz = 'US/Pacific'
    ...: s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz=tz)
    ...: s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone(tz))])
    ...:
    ...: assert s1.series_equal(s2)

In [65]: s1
Out[65]:
shape: (1,)
Series: 'dt' [datetime[μs, US/Pacific]]
[
        2000-12-31 16:00:00 PST
]

In [66]: s2
Out[66]:
shape: (1,)
Series: 'dt' [datetime[μs, US/Pacific]]
[
        2000-12-31 16:00:00 PST
]

Also, your code shows that pl.Series("dt", [datetime.datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))]) does indeed include timezone info (Series: 'dt' [datetime[μs, UTC]]), whereas the original report has a comment on that line saying # Does not include time zone info

I'm generally interested in time-series, but if I can't reproduce the issue then I don't know where to start - so any help with reproducing this would be appreciated

MarcoGorelli avatar Jan 20 '23 13:01 MarcoGorelli

It might be that I fixed this already 🙈 Hmm.. @stinodego could you try on master?

ritchie46 avatar Jan 20 '23 14:01 ritchie46

OK, I think it actually changed from when I initially reported it, but this is what happens now (from the master branch):

import polars as pl
from datetime import datetime
import pytz

# Correct output
s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz="UTC")  # Includes time zone info
print(s1)

# shape: (1,)
# Series: 'dt' [datetime[μs, UTC]]
# [
 #        2001-01-01 00:00:00 UTC
# ]

# Incorrect output
s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])  # Time shifts by an hour??
print(s2)

# shape: (1,)
# Series: 'dt' [datetime[μs, UTC]]
# [
#         2000-12-31 23:00:00 UTC
# ]

assert s1.series_equal(s2)  # Fails

As you can see, the resulting Series now actually contains timezone information, but the underlying datetime is incorrect.

stinodego avatar Jan 20 '23 16:01 stinodego

Ah got it, here's how to reproduce if you live in a UTC place:

import polars as pl
from datetime import datetime
import pytz
import os
import time

os.environ['TZ'] = 'Europe/Brussels'
time.tzset()

s1 = pl.Series("dt", [datetime(2001, 1, 1)]).dt.with_time_zone(tz="UTC")
s2 = pl.Series("dt", [datetime(2001, 1, 1).astimezone(pytz.timezone("UTC"))])

print(s1)
print(s2)

assert s1.series_equal(s2)  # Fails

MarcoGorelli avatar Jan 20 '23 17:01 MarcoGorelli

Right, so as far as I can tell:

  • in polars, .dt.with_time_zone(tz) on a naive time series will convert from UTC to tz
  • in Python datetime, it converts from your local timezone to tz (indeed, as @ghuls had said, I'd just misunderstood, sorry)

So, all looks correct, I'd suggest just adding an example and test and closing - I'll make a quick PR

MarcoGorelli avatar Jan 20 '23 17:01 MarcoGorelli

Wow, you're right. The conversion already happens in Python datetime.

Feels very unintuitive to me, but timezones do that sometimes. At least Polars seems to handle things correctly.

A PR is welcome; then we can close this.

stinodego avatar Jan 20 '23 17:01 stinodego