cftime
cftime copied to clipboard
Dealing with netcdf time:units with unexpected characters
I've found a delightful edge case that is a little hard to believe. It involves a netcdf time:units
that includes a character outside of the [0-9,-] range. If it's not obvious from the below, the issue is that the time:units = "days since 20O1-1-1"
whereas this should be time:units = "days since 2001-1-1"
(so replacing the rogue "O" (oooh), with the numeral "0" zero).
The file is a 297MiB file downloadable from here
Below is the example reproducing the error:
Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> xr.__version__
'2022.6.0'
>>> import cftime
>>> import cftime as cft
>>> cft.__version__
'1.6.1'
>>> ds = xr.open_dataset("/p/css03/esgf_publish/cmip3/ipcc/data3/sresa2/ice/mo/sic/ingv_echam4/run1/sic_O1.nc")
Traceback (most recent call last):
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 270, in decode_cf_datetime
dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 207, in _decode_datetime_with_pandas
raise OutOfBoundsDatetime(
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Cannot decode times from a non-standard calendar, '360_day', using pandas.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 180, in _decode_cf_datetime_dtype
result = decode_cf_datetime(example_value, units, calendar, use_cftime)
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 272, in decode_cf_datetime
dates = _decode_datetime_with_cftime(
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 201, in _decode_datetime_with_cftime
cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
File "src/cftime/_cftime.pyx", line 549, in cftime._cftime.num2date
File "src/cftime/_cftime.pyx", line 107, in cftime._cftime._dateparse
File "src/cftime/_cftime.pyx", line 750, in cftime._cftime._parse_date
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/api.py", line 531, in open_dataset
backend_ds = backend.open_dataset(
File "~mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 569, in open_dataset
ds = store_entrypoint.open_dataset(
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/backends/store.py", line 29, in open_dataset
vars, attrs, coord_names = conventions.decode_cf_variables(
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 521, in decode_cf_variables
new_vars[k] = decode_cf_variable(
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/conventions.py", line 369, in decode_cf_variable
var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 682, in decode
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
File "~/mambaforge/envs/xcd031spy532mat353/lib/python3.10/site-packages/xarray/coding/times.py", line 190, in _decode_cf_datetime_dtype
raise ValueError(msg)
ValueError: unable to decode time units 'days since 20O1-1-1' with "calendar '360_day'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
I wonder if a regex check would be useful to implement? This problem tripped me up for a while, and it was not at all obvious that an incorrect character (which looks almost identical, depending on fonts) was the root cause. Testing for a datestring that matches regex r"(?:[0-9][0-9])?[0-9][0-9]-(?:[0-1])?[0-9]-(?:[0-3])?[0-9]"
could be a useful test to catch such a fringe case - and point out the issue obviously in the error message. It seems in the CF Conventions docs that there is little leeway in this format, so using "/"
or alternative MM-DD-YYYY
formats to the standard [YYY]Y-[M]M-[D]D HH:MM:SS.ss [-]0:00
And just because https://github.com/pydata/xarray/discussions/7144
A PR that catches this, and adds more leeway in the formatting (consistent with CF) would be welcome.
@jswhit I am a little confused how the regex following the CF standards at https://github.com/Unidata/cftime/blob/master/src/cftime/_cftime.pyx#L45-L47 isn't catching this, it's an out-of-bounds character
I'm unfamiliar with this library, so would need some pointers to get started
Sorry, can't be of much help - as the comment says this regex was lifted from http://delete.me.uk/2005/03/iso8601.html but apparently that link no longer exists. I know almost nothing about regexes.
Hi - I've looked at this regex in the (dim and distant) past - happy to have take a look here, if that's OK.
The quick test below shows the regex passing in the "letter O" case, but has too many None
s in its matched groups. I'll have a look at whether the regex could/should be updated, or if a check in cpdef _parse_date(datestring)
(https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L750), which is where the regex is used, might be an alternative.
>>> import re
>>> ISO8601_REGEX = re.compile(r"(?P<year>[+-]?[0-9]+)(-(?P<month>[0-9]{1,2})(-(?P<day>[0-9]{1,2})"
r"(((?P<separator1>.)(?P<hour>[0-9]{1,2}):(?P<minute>[0-9]{1,2})(:(?P<second>[0-9]{1,2})(\.(?P<fraction>[0-9]+))?)?)?"
r"((?P<separator2>.?)(?P<timezone>Z|(([-+])([0-9]{2})((:([0-9]{2}))|([0-9]{2}))?)))?)?)?)?"
) # From https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L45-L48
>>> ISO8601_REGEX.match('2001-01-01').groups() # All numbers
('2001',
'-01-01',
'01',
'-01',
'01',
'',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None)
>>> ISO8601_REGEX.match('20O1-01-01').groups() # Letter O
('20',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None)
Thanks @davidhassell! I hit some weird duplication issue in more simple regex I was using, it validated correctly, but then seemed to cycle over things again and again - https://regex101.com/ was really helpful in pointing that out to me