polars
polars copied to clipboard
`pl.Datetime` `time_zone` parameter has no type or value check
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Issue Description
The time zone parameter in pl.Datetime
can be an invalid string, and can even be any type of Python object. We should probably ensure that the time zone is a valid time zone when constructing a pl.Datetime
datatype.
import polars as pl
pl.Datetime(time_zone="invalid_time_zone")
# Datetime(time_unit='us', time_zone='invalid_time_zone')
pl.Datetime(time_zone=object())
# Datetime(time_unit='us', time_zone=<object object at 0x7f245331d0d0>)
Installed versions
main
thanks for reporting - the validation currently happens later:
In [9]: pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime(time_zone='cabbage'))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime(time_zone='cabbage'))
File ~/scratch/.venv/lib/python3.11/site-packages/polars/series/series.py:312, in Series.__init__(self, name, values, dtype, strict, nan_to_null, dtype_if_empty)
309 raise TypeError(msg)
311 if isinstance(values, Sequence):
--> 312 self._s = sequence_to_pyseries(
313 name,
314 values,
315 dtype=dtype,
316 strict=strict,
317 nan_to_null=nan_to_null,
318 )
320 elif values is None:
321 self._s = sequence_to_pyseries(name, [], dtype=dtype)
File ~/scratch/.venv/lib/python3.11/site-packages/polars/_utils/construction/series.py:235, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
225 if values_tz != "UTC" and dtype_tz is None:
226 warnings.warn(
227 "Constructing a Series with time-zone-aware "
228 "datetimes results in a Series with UTC time zone. "
(...)
233 stacklevel=find_stacklevel(),
234 )
--> 235 return s.dt.replace_time_zone(dtype_tz or "UTC")._s
236 return s._s
238 elif (
239 _check_for_numpy(value)
240 and isinstance(value, np.ndarray)
241 and len(value.shape) == 1
242 ):
File ~/scratch/.venv/lib/python3.11/site-packages/polars/series/utils.py:107, in call_expr.<locals>.wrapper(self, *args, **kwargs)
105 expr = getattr(expr, namespace)
106 f = getattr(expr, func.__name__)
--> 107 return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
File ~/scratch/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py:7906, in DataFrame.select_seq(self, *exprs, **named_exprs)
7883 def select_seq(
7884 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
7885 ) -> DataFrame:
7886 """
7887 Select columns from this DataFrame.
7888
(...)
7904 select
7905 """
-> 7906 return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
File ~/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1810, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
1807 if background:
1808 return InProcessQuery(ldf.collect_concurrently())
-> 1810 return wrap_df(ldf.collect())
ComputeError: unable to parse time zone: 'cabbage'. Please check the Time Zone Database for a list of available time zones
Hi @MarcoGorelli, should we add some checks here
https://github.com/pola-rs/polars/blob/2970c5706a5bbc39840ca5b0ffc9d4b3aab502cb/py-polars/polars/datatypes/classes.py#L461-L463
Similar to what's done here
https://github.com/pola-rs/polars/blob/2970c5706a5bbc39840ca5b0ffc9d4b3aab502cb/py-polars/polars/_utils/convert.py#L163-L194
Or should we leave the error as it is? Do you have any thoughts on this? If possible, I'd like to improve it.
Using Pyright/MyPy, you'll get a static type-checking warning if you try and pass in object()
to time_zone
here. The only way this could perhaps be built on (from a type-checking perspective), is maintaining a literal of all possible timezones, and using that instead of str
.
Related to this: it's a bit concerning that the invalid time zone can end up in Rust without it being validated:
import polars as pl
dtype = pl.Datetime(time_zone="invalid_time_zone")
s = pl.Series(dtype=dtype)
print(s)
shape: (0,)
Series: '' [datetime[μs, invalid_time_zone]]
[
]
true, validation only kicks in if there's an actual date in the column. maybe it can happen earlier
hey for testing time zones in polars, can we add
df[col_name].dt.tz
to get a series of time zone strings?
could be more performant than
def tz_from_df(df: pl.DataFrame, date_col_name: str = "date") -> str:
tz = df.schema[date_col_name].time_zone
return tz