kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

`TypeError` and empty filenames when using re.Pattern functionality of MultiZarrToZarr’s `coo_mapper`

Open rbanick opened this issue 2 years ago • 3 comments

I’ve ran into and addressed two issues while using the regex pattern functionality of coo_mapper on MultiZarrToZarr. It’s unclear whether these reflect intentional design decisions so I’m opening an issue to clarify whether a PR implementing my fixes is warranted or not.

These problems were encountered using kerchunk version 0.0.6 on Python 3.8.4.

Problem statement I am preprocessing a collection of NetCDFs that were originally standalone rasters (BILs) and thus lack a time dimension. Fortunately, their dates are expressed as 8-digit strings in the filename, i.e. file_20210915.nc. Therefore I generate the time dimension using a Regex pattern with the coo_map setting on MultiZarrToZarr:

nc_list = glob.glob(input_dir, “*.nc4”)
json_list = []
for nc in nc_list:
            fs = fsspec.filesystem('file')
            fs_nc = fs.open(nc)
            with fs_nc as infile:
                json_list.append(SingleHdf5ToZarr(infile, fs_nc.path).translate())

pattern = re.compile(r'([0-9]{8})') # extract 8 digit YYYYYmmdd strings from file names
mzz_opts = dict(    identical_dims = [“lat’, “lon”],
                    coo_map = {"time" : pattern},
                    concat_dims = [“time”])
mzz = MultiZarrToZarr(json_list, **mzz_opts)
mzz.translate()

This returns the following error

    def _get_value(self, index, z, var, fn=None):
        """derive coordinate value(s) for given input dataset

        How to map from input to

        index: int
            Current place in the list of inputs
        z: zarr group
            Open for the current input
        var: str
            name of value to extract.
        fn: str
            filename
        """
        selector = self.coo_map[var]
        if isinstance(selector, collections.abc.Callable):
            o = selector(index, z, var, fn)
        elif isinstance(selector, list):
            o = selector[index]
        elif isinstance(selector, re.Pattern):
>           o = selector.match(fn).groups[0]
E           TypeError: expected string or bytes-like object

The problem appears to be the use of re.match instead of re.search — match always starts from the 0 index of an object whereas my datetime strings are at the end of variable length strings. I can resolve this error by amending combine.py line 154 from o = selector.match(fn).groups[0] to o = re.search(selector,fn)[0].

More problematically, I found that combine.py’s fss property generates an empty list of filenames. Replacing line 116 (self._paths = [None] * len(fo_list) as below resolves the issue:

if isinstance(self.path[0], collections.abc.Mapping):
    self._paths = []
    fo_list = self.path
    # self._paths = [None] * len(fo_list) # replaced line
    for path in self.path: # new line
        self._paths.append(path["templates"]["u"]) # new line

The problem appears to be kerchunk dict objects being recognized as collections.abc.Mapping instances. Reviewing the code, it’s unclear to me if this is intentional or not. If the latter I’m happy to open a PR.

A small collection of sample files used for this operation can be found at https://www.dropbox.com/sh/yzbqqkqampcjcqo/AACKPVnB9bojkF5QVRZihU3za?dl=0.

rbanick avatar Jun 29 '22 18:06 rbanick

Hi @rbanick,

It looks like the Kerchunk package you are using is a bit older as the bug from o = selector.match(fn).groups[0] has already been fixed by @martindurant in https://github.com/fsspec/kerchunk/commit/5cfa8871d86fe165213cc18fef042cf82258a573

Then MultiZarrtoZarr is intended to be used for combining existing Kerchunk jsons rather than directly from the .nc4 files and so it is neccessary to first compute these using SingleHdf5ToZarr before passing these jsons to MultiZarrtoZarr.

Here is a quick example using the data you provided:

https://gist.github.com/peterm790/78ecd58d994d127594166e207ffeb515

peterm790 avatar Jun 30 '22 09:06 peterm790

It looks like the Kerchunk package you are using is a bit older

0.0.6 (the most recent release) was created before the fix, so I guess this calls for a bugfix release? @martindurant, what do you think?

keewis avatar Jun 30 '22 10:06 keewis

@keewis good to hear the first issue is cleared up.

@peterm790 sorry my example wasn't clear -- the error occurs when providing a list of JSONs already processed via SingleHdf5ToZarr. I've amended the code to be clearer.

Is the second issue with an empty list of filenames resolved in the latest Kerchunk release?

rbanick avatar Jul 06 '22 14:07 rbanick

@rbanick I'm facing almost the same issue and, according to version 0.9, seems that the match is still in use. Did you succeded? did you created a workaround?

Right now this are my thoughts: Assuming a path like: '/tmp/20230101/CGL_TOC_202301010708_Africa.yaml'

the regex that works to catch the date and time within the namefile and is compatible with the re.match is: ' re.compile('^.+CGL_TOC_(.+?)_') (in your specific case could be something like ^.+file_(.+?). )

This works like a charm but, unfortunately, the string obtained isn't converted to a DateTime64.
The attempt to convert it to a DateTime64 through the coo_dtypes can't work as the value isn't stored in a POSIX format within the file name.

If there is enough interest in this one option could be to add a keyword for the coo_map like 'time_fn:' that gets, through the regex, the time in string format and convert it to a DateTime64. @martindurant What do you think? Is there any other solution that is already in place and I didn't discovered ?

FYI @annefou

pl-marasco avatar Jan 18 '23 09:01 pl-marasco

@pl-marasco Back in 2022 the replies to this issue indicated the re.match/search issue was already being addressed and a PR was not necessary. I made my own fork with the fix and have been using that in my production code, so I didn't pay further attention to this issue.

3 versions later re.match is still in use instead of re.search so I guess the issue was not addressed. @martindurant is there a reason re.match remains or shall I submit a PR? My visibility on Kerchunk's architecture primarily comes from breaking it repeatedly so I didn't feel comfortable PRing without input from maintainers.

FYI I also had to patch line 156 (self._paths = [None] * len(fo_list)) in my fork to address the second issue of empty file path names.

rbanick avatar Jan 18 '23 14:01 rbanick

Please yes, make a PR.

martindurant avatar Jan 18 '23 14:01 martindurant

@rbanick Thanks for the quick answer on this. Have you also created a method to convert the text to a DateTime64?

pl-marasco avatar Jan 18 '23 15:01 pl-marasco

Not beyond astype("datetime"). There are many ways, I suppose.

martindurant avatar Jan 18 '23 15:01 martindurant

@martindurant I'll make a PR for both issues later today

@pl-marasco I ran the below code in a preprocess_zarr function I provide to xr.open_dataset when opening the zarr JSON Kerchunk produces.

Note that the string format I specified was specific to the file name strings I was parsing (from VHI).

dti = pd.to_datetime(ds.time.values, format="%Y%m%d%H%M%S%f")
ds = ds.assign_coords({'time' : dti})

rbanick avatar Jan 18 '23 15:01 rbanick

@rbanick thanks, that definitely would work but I was wondering if a specific method that's doing the same within kerchunk.combine.MultiZarrToZarr would make more sense.

This could be trigged by a new "time: re.Pattern" within coo_map or modifying the "cf:{var}" and would need the format and the calendar type to be defined.

pl-marasco avatar Jan 18 '23 16:01 pl-marasco

Note that you can give arbitrary functions to coo_map too, but I understand why you wouldn't want to.

martindurant avatar Jan 18 '23 16:01 martindurant