kerchunk
kerchunk copied to clipboard
`TypeError` and empty filenames when using re.Pattern functionality of MultiZarrToZarr’s `coo_mapper`
I’ve ran into and addressed two issues while using the regex pattern functionality of coo_mapper
on MultiZarrToZarr
. It’s unclear whether these reflect intentional design decisions so I’m opening an issue to clarify whether a PR implementing my fixes is warranted or not.
These problems were encountered using kerchunk version 0.0.6
on Python 3.8.4
.
Problem statement
I am preprocessing a collection of NetCDFs that were originally standalone rasters (BILs) and thus lack a time
dimension. Fortunately, their dates are expressed as 8-digit strings in the filename, i.e. file_20210915.nc
. Therefore I generate the time dimension using a Regex pattern with the coo_map
setting on MultiZarrToZarr
:
nc_list = glob.glob(input_dir, “*.nc4”)
json_list = []
for nc in nc_list:
fs = fsspec.filesystem('file')
fs_nc = fs.open(nc)
with fs_nc as infile:
json_list.append(SingleHdf5ToZarr(infile, fs_nc.path).translate())
pattern = re.compile(r'([0-9]{8})') # extract 8 digit YYYYYmmdd strings from file names
mzz_opts = dict( identical_dims = [“lat’, “lon”],
coo_map = {"time" : pattern},
concat_dims = [“time”])
mzz = MultiZarrToZarr(json_list, **mzz_opts)
mzz.translate()
This returns the following error
def _get_value(self, index, z, var, fn=None):
"""derive coordinate value(s) for given input dataset
How to map from input to
index: int
Current place in the list of inputs
z: zarr group
Open for the current input
var: str
name of value to extract.
fn: str
filename
"""
selector = self.coo_map[var]
if isinstance(selector, collections.abc.Callable):
o = selector(index, z, var, fn)
elif isinstance(selector, list):
o = selector[index]
elif isinstance(selector, re.Pattern):
> o = selector.match(fn).groups[0]
E TypeError: expected string or bytes-like object
The problem appears to be the use of re.match
instead of re.search
— match always starts from the 0 index of an object whereas my datetime strings are at the end of variable length strings. I can resolve this error by amending combine.py
line 154 from o = selector.match(fn).groups[0]
to o = re.search(selector,fn)[0]
.
More problematically, I found that combine.py
’s fss
property generates an empty list of filenames. Replacing line 116 (self._paths = [None] * len(fo_list)
as below resolves the issue:
if isinstance(self.path[0], collections.abc.Mapping):
self._paths = []
fo_list = self.path
# self._paths = [None] * len(fo_list) # replaced line
for path in self.path: # new line
self._paths.append(path["templates"]["u"]) # new line
The problem appears to be kerchunk dict objects being recognized as collections.abc.Mapping
instances. Reviewing the code, it’s unclear to me if this is intentional or not. If the latter I’m happy to open a PR.
A small collection of sample files used for this operation can be found at https://www.dropbox.com/sh/yzbqqkqampcjcqo/AACKPVnB9bojkF5QVRZihU3za?dl=0.
Hi @rbanick,
It looks like the Kerchunk package you are using is a bit older as the bug from o = selector.match(fn).groups[0]
has already been fixed by @martindurant in https://github.com/fsspec/kerchunk/commit/5cfa8871d86fe165213cc18fef042cf82258a573
Then MultiZarrtoZarr
is intended to be used for combining existing Kerchunk jsons rather than directly from the .nc4 files and so it is neccessary to first compute these using SingleHdf5ToZarr
before passing these jsons to MultiZarrtoZarr
.
Here is a quick example using the data you provided:
https://gist.github.com/peterm790/78ecd58d994d127594166e207ffeb515
It looks like the Kerchunk package you are using is a bit older
0.0.6
(the most recent release) was created before the fix, so I guess this calls for a bugfix release? @martindurant, what do you think?
@keewis good to hear the first issue is cleared up.
@peterm790 sorry my example wasn't clear -- the error occurs when providing a list of JSONs already processed via SingleHdf5ToZarr
. I've amended the code to be clearer.
Is the second issue with an empty list of filenames resolved in the latest Kerchunk release?
@rbanick I'm facing almost the same issue and, according to version 0.9, seems that the match is still in use. Did you succeded? did you created a workaround?
Right now this are my thoughts:
Assuming a path like:
'/tmp/20230101/CGL_TOC_202301010708_Africa.yaml'
the regex that works to catch the date and time within the namefile and is compatible with the re.match is:
' re.compile('^.+CGL_TOC_(.+?)_')
(in your specific case could be something like ^.+file_(.+?). )
This works like a charm but, unfortunately, the string obtained isn't converted to a DateTime64.
The attempt to convert it to a DateTime64 through the coo_dtypes can't work as the value isn't stored in a POSIX format within the file name.
If there is enough interest in this one option could be to add a keyword for the coo_map like 'time_fn:' that gets, through the regex, the time in string format and convert it to a DateTime64. @martindurant What do you think? Is there any other solution that is already in place and I didn't discovered ?
FYI @annefou
@pl-marasco Back in 2022 the replies to this issue indicated the re.match/search
issue was already being addressed and a PR was not necessary. I made my own fork with the fix and have been using that in my production code, so I didn't pay further attention to this issue.
3 versions later re.match
is still in use instead of re.search
so I guess the issue was not addressed. @martindurant is there a reason re.match
remains or shall I submit a PR? My visibility on Kerchunk's architecture primarily comes from breaking it repeatedly so I didn't feel comfortable PRing without input from maintainers.
FYI I also had to patch line 156 (self._paths = [None] * len(fo_list)
) in my fork to address the second issue of empty file path names.
Please yes, make a PR.
@rbanick Thanks for the quick answer on this. Have you also created a method to convert the text to a DateTime64?
Not beyond astype("datetime"). There are many ways, I suppose.
@martindurant I'll make a PR for both issues later today
@pl-marasco I ran the below code in a preprocess_zarr
function I provide to xr.open_dataset
when opening the zarr JSON Kerchunk produces.
Note that the string format I specified was specific to the file name strings I was parsing (from VHI).
dti = pd.to_datetime(ds.time.values, format="%Y%m%d%H%M%S%f")
ds = ds.assign_coords({'time' : dti})
@rbanick thanks, that definitely would work but I was wondering if a specific method that's doing the same within kerchunk.combine.MultiZarrToZarr would make more sense.
This could be trigged by a new "time: re.Pattern"
within coo_map or modifying the "cf:{var}"
and would need the format and the calendar type to be defined.
Note that you can give arbitrary functions to coo_map too, but I understand why you wouldn't want to.