pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Keeping a list of CMIP6 datasets for which the basic tutorial recipe does not work

Open naomi-henderson opened this issue 3 years ago • 17 comments

It would be very helpful if those using, or attempting to use, the CMIP6 recipe would add their list of failures to this issue.

Some of these failures will not be fixable, but some will just need simple pre-processing or added xarray kwargs.

Hopefully we can keep track of them here.

naomi-henderson avatar Apr 16 '21 12:04 naomi-henderson

Here is a failure:

dataset = 'IPSL-CM6A-LR.abrupt-4xCO2.r1i1p1f1.Lmon.cLeaf.gr'

What happens?

The netcdf files cannot be opened without specifying decode_coords=False in a simple xr.open_dataset() or in the recipe itself. For example:

file_url = fs_s3.open(input_urls[0], mode='rb')
ds = xr.open_dataset(file_url)
ds

gives the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-ad8e2cbfca11> in <module>
      3 
      4 file_url = fs_s3.open(input_urls[0], mode='rb')
----> 5 ds = xr.open_dataset(file_url)
      6 ds

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    555 
    556     with close_on_error(store):
--> 557         ds = maybe_decode_store(store, chunks)
    558 
    559     # Ensure source filename always stored in dataset object (GH issue #2550)

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks)
    451 
    452     def maybe_decode_store(store, chunks):
--> 453         ds = conventions.decode_cf(
    454             store,
    455             mask_and_scale=mask_and_scale,

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    644         raise TypeError("can only decode Dataset or DataStore objects")
    645 
--> 646     vars, attrs, coord_names = decode_cf_variables(
    647         vars,
    648         attrs,

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    524             if "coordinates" in var_attrs:
    525                 coord_str = var_attrs["coordinates"]
--> 526                 var_coord_names = coord_str.split()
    527                 if all(k in variables for k in var_coord_names):
    528                     new_vars[k].encoding["coordinates"] = coord_str

AttributeError: 'Empty' object has no attribute 'split'

Attempted Fix:

Adding xarray_open_kwargs = {'decode_coords':False}, to the recipe gets past this problem. But then, caching the input data:

for input_name in recipe.iter_inputs():
      recipe.cache_input(input_name)

throws another error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-1fbdc3dc8329> in <module>
      4     # if caching the input:  (must also use if metadata_caching?)
      5     for input_name in recipe.iter_inputs():
----> 6         recipe.cache_input(input_name)
      7 
      8     # use recipe to create the zarr store:

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/pangeo_forge/recipe.py in cache_func(input_key)
    339             _copy_btw_filesystems(input_opener, target_opener)
    340             if self._cache_metadata:
--> 341                 self.cache_input_metadata(input_key)
    342 
    343         return cache_func

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/pangeo_forge/recipe.py in cache_input_metadata(self, input_key)
    444             metadata = ds.to_dict(data=False)
    445             mapper = self.metadata_cache.get_mapper()
--> 446             mapper[_input_metadata_fname(input_key)] = json.dumps(metadata).encode("utf-8")
    447 
    448     @contextmanager

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    229         cls is None and indent is None and separators is None and
    230         default is None and not sort_keys and not kw):
--> 231         return _default_encoder.encode(obj)
    232     if cls is None:
    233         cls = JSONEncoder

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in default(self, o)
    177 
    178         """
--> 179         raise TypeError(f'Object of type {o.__class__.__name__} '
    180                         f'is not JSON serializable')
    181 

TypeError: Object of type Empty is not JSON serializable

naomi-henderson avatar Apr 16 '21 12:04 naomi-henderson

Naomi do you understand what it is about this dataset that is causing the error?

rabernat avatar Apr 16 '21 12:04 rabernat

Actually, no, it is puzzling. It looks perfectly normal when using decode_coords=False :

<xarray.Dataset>
Dimensions:      (axis_nbounds: 2, lat: 143, lon: 144, time: 3600)
Coordinates:
  * lat          (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
  * lon          (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  * time         (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT
Dimensions without coordinates: axis_nbounds
Data variables:
    time_bounds  (time, axis_nbounds) datetime64[ns] ...
    cLeaf        (time, lat, lon) float32 ...
Attributes: (12/51)
    name:                   /ccc/work/cont003/gencmip6/p86maf/IGCM_OUT/IPSLCM...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2018-05-03T13:17:14Z
    description:            DECK: abrupt-4xCO2
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / CM...
    activity_id:            CMIP
    ...                     ...
    EXPID:                  abrupt-4xCO2
    CMIP6_CV_version:       cv=6.2.3.5-2-g63b123e
    dr2xml_md5sum:          00e1a4f623b35a33620b9828c66bd1c8
    model_version:          6.1.2
    tracking_id:            hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98255426b74
    history:                Thu Jul 26 18:44:23 2018: ncatted -O -a tracking_...

and seems fine when downloading and then opening the downloaded copy with xr.open_dataset() (without the decode_times=False)! So only on the S3 copy when opening through the s3fs.S3FileSystem. I will try a direct open, if their opendap server is working.

Here is the link to the file directly from esgf: cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc

naomi-henderson avatar Apr 16 '21 13:04 naomi-henderson

... and it opens fine using the opendap url:

url = 'https://aws-cloudnode.esgfed.org/thredds/dodsC/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
ds = xr.open_dataset(url)

so the only problem is with:

fs_s3 = s3fs.S3FileSystem(anon=True)

s3_url = 's3://esgf-world/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
file_url = fs_s3.open(s3_url, mode='rb')
ds = xr.open_dataset(file_url) 

naomi-henderson avatar Apr 16 '21 13:04 naomi-henderson

So it sounds like we have found an s3fs bug? pinging @martindurant, our fsspec superhero 🦸

rabernat avatar Apr 16 '21 13:04 rabernat

Just to help get to the bottom of it, could you try the following.

ds1 = ds.open_dataset(opendap_url, decode_coords=False).load()
ds2 = ds.open-dataset(s3_file, decode_coords=False).load()
ds1.identical(ds2)

This should help surface whatever tiny difference is behind this problem.

rabernat avatar Apr 16 '21 13:04 rabernat

Logging from s3fs shows nothing more than a bunch of range requests for the given file, I don't see anything immediately suspicious

martindurant avatar Apr 16 '21 14:04 martindurant

The ds1.identical(ds2) returns False - but the only thing that is different in the basic information is an extra attribute in the opendap example, ds1, called 'DODS_EXTRA.Unlimited_Dimension: time'.

opendap_url = 'https://aws-cloudnode.esgfed.org/thredds/dodsC/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
fs_s3 = s3fs.S3FileSystem(anon=True)
s3_url = 's3://esgf-world/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
s3_file = fs_s3.open(s3_url, mode='rb')

ds1 = xr.open_dataset(opendap_url, decode_coords=False).load()
ds2 = xr.open_dataset(s3_file, decode_coords=False).load()
ds1.identical(ds2)
False

print(ds1)

<xarray.Dataset> Dimensions: (axis_nbounds: 2, lat: 143, lon: 144, time: 3600) Coordinates:

  • lat (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
  • lon (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  • time (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT Dimensions without coordinates: axis_nbounds Data variables: time_bounds (time, axis_nbounds) datetime64[ns] 1850-01-01 ... NaT cLeaf (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan Attributes: (12/52) name: /ccc/work/cont003/gencmip6/p86maf/IGCM_O... Conventions: CF-1.7 CMIP-6.2 creation_date: 2018-05-03T13:17:14Z description: DECK: abrupt-4xCO2 title: IPSL-CM6A-LR model output prepared for C... activity_id: CMIP ... ... CMIP6_CV_version: cv=6.2.3.5-2-g63b123e dr2xml_md5sum: 00e1a4f623b35a33620b9828c66bd1c8 model_version: 6.1.2 tracking_id: hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98... history: Thu Jul 26 18:44:23 2018: ncatted -O -a ... DODS_EXTRA.Unlimited_Dimension: time
print(ds2)

<xarray.Dataset> Dimensions: (axis_nbounds: 2, lat: 143, lon: 144, time: 3600) Coordinates:

  • lat (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
  • lon (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  • time (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT Dimensions without coordinates: axis_nbounds Data variables: time_bounds (time, axis_nbounds) datetime64[ns] 1850-01-01 ... NaT cLeaf (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan Attributes: (12/51) name: /ccc/work/cont003/gencmip6/p86maf/IGCM_OUT/IPSLCM... Conventions: CF-1.7 CMIP-6.2 creation_date: 2018-05-03T13:17:14Z description: DECK: abrupt-4xCO2 title: IPSL-CM6A-LR model output prepared for CMIP6 / CM... activity_id: CMIP ... ... EXPID: abrupt-4xCO2 CMIP6_CV_version: cv=6.2.3.5-2-g63b123e dr2xml_md5sum: 00e1a4f623b35a33620b9828c66bd1c8 model_version: 6.1.2 tracking_id: hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98255426b74 history: Thu Jul 26 18:44:23 2018: ncatted -O -a tracking_...

naomi-henderson avatar Apr 16 '21 15:04 naomi-henderson

The problem is with the coordinates attribute

ds2c = ds2.copy()
del ds2c.cLeaf.attrs['coordinates']
xr.decode_cf(ds2c)

rabernat avatar Apr 16 '21 15:04 rabernat

type(ds2.cLeaf.attrs["coordinates"])
>>> h5py._hl.base.Empty

This looks like an h5py issue actually.

rabernat avatar Apr 16 '21 15:04 rabernat

I got to the bottom of this in the xarray issue above.

It actually has nothing to do with fsspec. It's a difference between the netCDF4 engine and the h5netcdf engine in how they handle existing but empty attributes.

One way to solve this would be to use copy_input_to_local_file (see #87). Once you have a local file, xarray with default to engine='netCDF4'.

rabernat avatar Apr 16 '21 16:04 rabernat

That is a fix for the recipe, perhaps, but when I just want to xr.open_dataset() using an S3 file-like object (in order to find a good chunk size, for example) it also does not work. This is why I was using the opendap urls, before (but their opendap server has not been behaving well...). How does copy_input_to_local_file manage this? Sorry, perhaps I am being dense here.

naomi-henderson avatar Apr 16 '21 16:04 naomi-henderson

Currently only h5netcdf can open files efficiently over http / s3 via fsspec. Your example has uncovered a bug in h5netcdf (https://github.com/pydata/xarray/issues/5172). Therefore, you will not be able to do what you want (open the file remotely) until that bug is fixed.

rabernat avatar Apr 16 '21 16:04 rabernat

Another workaround would be to use a preprocessing function to drop the weird coordinates attribute. That is what is causing decode_coords to fail.

rabernat avatar Apr 16 '21 17:04 rabernat

Yes, I like that suggestion. It will be easy to document in our future Ag-style database of pre-concatenation issues and fixes.

naomi-henderson avatar Apr 16 '21 17:04 naomi-henderson

According to this comment, the underlying issue discussed above should be resolved with h5netcdf >= 0.11.0.

Here's a notebook I ran today for dataset = 'IPSL-CM6A-LR.abrupt-4xCO2.r1i1p1f1.Lmon.cLeaf.gr' with h5netcdf == 0.11.0: https://gist.github.com/cisaacstern/8660abfb204412dc447e28c6f50dd279. The full output of xr.show_versions() is provided in cell 19.

@naomi-henderson, does is seem like the above-discussed issue is indeed resolved, or have I overlooked something?

cisaacstern avatar May 10 '21 17:05 cisaacstern

@cisaacstern , yes, it looks like this issue is resolved with the newer h5netcdf. Thanks. I would like to keep this issue open so we can continue to find examples of datasets with issues which can't be handled yet.

Nice work with the CMIP6 tutorial, by the way. The CSV file actually has a column labeled temporal subset which could be used directly to get your keys argument for ConcatDim.

naomi-henderson avatar May 12 '21 12:05 naomi-henderson