pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
Keeping a list of CMIP6 datasets for which the basic tutorial recipe does not work
It would be very helpful if those using, or attempting to use, the CMIP6 recipe would add their list of failures to this issue.
Some of these failures will not be fixable, but some will just need simple pre-processing or added xarray kwargs.
Hopefully we can keep track of them here.
Here is a failure:
dataset = 'IPSL-CM6A-LR.abrupt-4xCO2.r1i1p1f1.Lmon.cLeaf.gr'
What happens?
The netcdf files cannot be opened without specifying decode_coords=False in a simple xr.open_dataset() or in the recipe itself. For example:
file_url = fs_s3.open(input_urls[0], mode='rb')
ds = xr.open_dataset(file_url)
ds
gives the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-ad8e2cbfca11> in <module>
3
4 file_url = fs_s3.open(input_urls[0], mode='rb')
----> 5 ds = xr.open_dataset(file_url)
6 ds
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
555
556 with close_on_error(store):
--> 557 ds = maybe_decode_store(store, chunks)
558
559 # Ensure source filename always stored in dataset object (GH issue #2550)
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks)
451
452 def maybe_decode_store(store, chunks):
--> 453 ds = conventions.decode_cf(
454 store,
455 mask_and_scale=mask_and_scale,
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
644 raise TypeError("can only decode Dataset or DataStore objects")
645
--> 646 vars, attrs, coord_names = decode_cf_variables(
647 vars,
648 attrs,
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
524 if "coordinates" in var_attrs:
525 coord_str = var_attrs["coordinates"]
--> 526 var_coord_names = coord_str.split()
527 if all(k in variables for k in var_coord_names):
528 new_vars[k].encoding["coordinates"] = coord_str
AttributeError: 'Empty' object has no attribute 'split'
Attempted Fix:
Adding xarray_open_kwargs = {'decode_coords':False}, to the recipe gets past this problem.
But then, caching the input data:
for input_name in recipe.iter_inputs():
recipe.cache_input(input_name)
throws another error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-1fbdc3dc8329> in <module>
4 # if caching the input: (must also use if metadata_caching?)
5 for input_name in recipe.iter_inputs():
----> 6 recipe.cache_input(input_name)
7
8 # use recipe to create the zarr store:
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/pangeo_forge/recipe.py in cache_func(input_key)
339 _copy_btw_filesystems(input_opener, target_opener)
340 if self._cache_metadata:
--> 341 self.cache_input_metadata(input_key)
342
343 return cache_func
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/pangeo_forge/recipe.py in cache_input_metadata(self, input_key)
444 metadata = ds.to_dict(data=False)
445 mapper = self.metadata_cache.get_mapper()
--> 446 mapper[_input_metadata_fname(input_key)] = json.dumps(metadata).encode("utf-8")
447
448 @contextmanager
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/json/encoder.py in default(self, o)
177
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
181
TypeError: Object of type Empty is not JSON serializable
Naomi do you understand what it is about this dataset that is causing the error?
Actually, no, it is puzzling. It looks perfectly normal when using decode_coords=False :
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, lat: 143, lon: 144, time: 3600)
Coordinates:
* lat (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
* lon (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
* time (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT
Dimensions without coordinates: axis_nbounds
Data variables:
time_bounds (time, axis_nbounds) datetime64[ns] ...
cLeaf (time, lat, lon) float32 ...
Attributes: (12/51)
name: /ccc/work/cont003/gencmip6/p86maf/IGCM_OUT/IPSLCM...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2018-05-03T13:17:14Z
description: DECK: abrupt-4xCO2
title: IPSL-CM6A-LR model output prepared for CMIP6 / CM...
activity_id: CMIP
... ...
EXPID: abrupt-4xCO2
CMIP6_CV_version: cv=6.2.3.5-2-g63b123e
dr2xml_md5sum: 00e1a4f623b35a33620b9828c66bd1c8
model_version: 6.1.2
tracking_id: hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98255426b74
history: Thu Jul 26 18:44:23 2018: ncatted -O -a tracking_...
and seems fine when downloading and then opening the downloaded copy with xr.open_dataset() (without the decode_times=False)! So only on the S3 copy when opening through the s3fs.S3FileSystem. I will try a direct open, if their opendap server is working.
Here is the link to the file directly from esgf: cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc
... and it opens fine using the opendap url:
url = 'https://aws-cloudnode.esgfed.org/thredds/dodsC/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
ds = xr.open_dataset(url)
so the only problem is with:
fs_s3 = s3fs.S3FileSystem(anon=True)
s3_url = 's3://esgf-world/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
file_url = fs_s3.open(s3_url, mode='rb')
ds = xr.open_dataset(file_url)
So it sounds like we have found an s3fs bug? pinging @martindurant, our fsspec superhero 🦸
Just to help get to the bottom of it, could you try the following.
ds1 = ds.open_dataset(opendap_url, decode_coords=False).load()
ds2 = ds.open-dataset(s3_file, decode_coords=False).load()
ds1.identical(ds2)
This should help surface whatever tiny difference is behind this problem.
Logging from s3fs shows nothing more than a bunch of range requests for the given file, I don't see anything immediately suspicious
The ds1.identical(ds2) returns False - but the only thing that is different in the basic information is an extra attribute in the opendap example, ds1, called 'DODS_EXTRA.Unlimited_Dimension: time'.
opendap_url = 'https://aws-cloudnode.esgfed.org/thredds/dodsC/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
fs_s3 = s3fs.S3FileSystem(anon=True)
s3_url = 's3://esgf-world/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r1i1p1f1/Lmon/cLeaf/gr/v20190118/cLeaf_Lmon_IPSL-CM6A-LR_abrupt-4xCO2_r1i1p1f1_gr_185001-214912.nc'
s3_file = fs_s3.open(s3_url, mode='rb')
ds1 = xr.open_dataset(opendap_url, decode_coords=False).load()
ds2 = xr.open_dataset(s3_file, decode_coords=False).load()
ds1.identical(ds2)
False
print(ds1)
<xarray.Dataset> Dimensions: (axis_nbounds: 2, lat: 143, lon: 144, time: 3600) Coordinates:
- lat (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
- lon (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
- time (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT Dimensions without coordinates: axis_nbounds Data variables: time_bounds (time, axis_nbounds) datetime64[ns] 1850-01-01 ... NaT cLeaf (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan Attributes: (12/52) name: /ccc/work/cont003/gencmip6/p86maf/IGCM_O... Conventions: CF-1.7 CMIP-6.2 creation_date: 2018-05-03T13:17:14Z description: DECK: abrupt-4xCO2 title: IPSL-CM6A-LR model output prepared for C... activity_id: CMIP ... ... CMIP6_CV_version: cv=6.2.3.5-2-g63b123e dr2xml_md5sum: 00e1a4f623b35a33620b9828c66bd1c8 model_version: 6.1.2 tracking_id: hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98... history: Thu Jul 26 18:44:23 2018: ncatted -O -a ... DODS_EXTRA.Unlimited_Dimension: time
print(ds2)
<xarray.Dataset> Dimensions: (axis_nbounds: 2, lat: 143, lon: 144, time: 3600) Coordinates:
- lat (lat) float32 -90.0 -88.73 -87.46 -86.2 ... 87.46 88.73 90.0
- lon (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
- time (time) datetime64[ns] 1850-01-16T12:00:00 1850-02-15 ... NaT Dimensions without coordinates: axis_nbounds Data variables: time_bounds (time, axis_nbounds) datetime64[ns] 1850-01-01 ... NaT cLeaf (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan Attributes: (12/51) name: /ccc/work/cont003/gencmip6/p86maf/IGCM_OUT/IPSLCM... Conventions: CF-1.7 CMIP-6.2 creation_date: 2018-05-03T13:17:14Z description: DECK: abrupt-4xCO2 title: IPSL-CM6A-LR model output prepared for CMIP6 / CM... activity_id: CMIP ... ... EXPID: abrupt-4xCO2 CMIP6_CV_version: cv=6.2.3.5-2-g63b123e dr2xml_md5sum: 00e1a4f623b35a33620b9828c66bd1c8 model_version: 6.1.2 tracking_id: hdl:21.14100/ac30efbd-7121-4ee5-9fa8-c98255426b74 history: Thu Jul 26 18:44:23 2018: ncatted -O -a tracking_...
The problem is with the coordinates attribute
ds2c = ds2.copy()
del ds2c.cLeaf.attrs['coordinates']
xr.decode_cf(ds2c)
type(ds2.cLeaf.attrs["coordinates"])
>>> h5py._hl.base.Empty
This looks like an h5py issue actually.
I got to the bottom of this in the xarray issue above.
It actually has nothing to do with fsspec. It's a difference between the netCDF4 engine and the h5netcdf engine in how they handle existing but empty attributes.
One way to solve this would be to use copy_input_to_local_file (see #87). Once you have a local file, xarray with default to engine='netCDF4'.
That is a fix for the recipe, perhaps, but when I just want to xr.open_dataset() using an S3 file-like object (in order to find a good chunk size, for example) it also does not work. This is why I was using the opendap urls, before (but their opendap server has not been behaving well...). How does copy_input_to_local_file manage this? Sorry, perhaps I am being dense here.
Currently only h5netcdf can open files efficiently over http / s3 via fsspec. Your example has uncovered a bug in h5netcdf (https://github.com/pydata/xarray/issues/5172). Therefore, you will not be able to do what you want (open the file remotely) until that bug is fixed.
Another workaround would be to use a preprocessing function to drop the weird coordinates attribute. That is what is causing decode_coords to fail.
Yes, I like that suggestion. It will be easy to document in our future Ag-style database of pre-concatenation issues and fixes.
According to this comment, the underlying issue discussed above should be resolved with h5netcdf >= 0.11.0.
Here's a notebook I ran today for dataset = 'IPSL-CM6A-LR.abrupt-4xCO2.r1i1p1f1.Lmon.cLeaf.gr' with h5netcdf == 0.11.0: https://gist.github.com/cisaacstern/8660abfb204412dc447e28c6f50dd279. The full output of xr.show_versions() is provided in cell 19.
@naomi-henderson, does is seem like the above-discussed issue is indeed resolved, or have I overlooked something?
@cisaacstern , yes, it looks like this issue is resolved with the newer h5netcdf. Thanks. I would like to keep this issue open so we can continue to find examples of datasets with issues which can't be handled yet.
Nice work with the CMIP6 tutorial, by the way. The CSV file actually has a column labeled temporal subset which could be used directly to get your keys argument for ConcatDim.