intake-esm
intake-esm copied to clipboard
intake server leads to ERROR: KeyError('xarray')
Description
I tried out the intake-server. In the end, I would like to have a server for some or all cats of:
https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/
The entry catalog is
dkrz_data-pool_cloudcatalog.yaml
I started the server from the same environment as from where I did client commands. I installed intake_xarray as well.
Do you have an idea what the problem is? I saw that the remote catalog wants to use sth like container: xarray what I do not really understand. What is a container? Why xarray?
What I Did
cat > temp.yaml <<EOF
description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
source:
- module: intake_esm
sources:
dkrz_cmip6_cloud_zarr:
args:
esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
driver:
- intake.open_esm_datastore
>EOF
intake-server temp.yaml 1>log 2>&1 &
intake list --full intake://localhost:8898
ERROR: KeyError('xarray')
cat log
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L53 - Creating catalog from:
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L55 - - temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L62 - catalog_args: temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L70 - Listening on localhost:8898
2022-02-10 17:33:06,509 - intake - DEBUG - server.py:__init__:L32 - auth: {'cls': 'intake.auth.base.BaseAuth'}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L241 - Source POST: {'action': 'open', 'name': 'dkrz_cmip6_cloud_zarr', 'parameters': {}, 'available_plugins': ['yaml_file_cat', 'yaml_files_cat', 'netcdf', 'opendap', 'rasterio', 'remote-xarray', 'xarray_image', 'zarr', 'alias', 'catalog', 'csv', 'intake_remote', 'json', 'jsonl', 'ndzarr', 'numpy', 'textfiles', 'tiled', 'tiled_cat', 'zarr_cat', 'esm_datastore', 'esm_group', 'esm_single_source']}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L302 - Opening entry <tzis_template catalog with 480 dataset(s) from 480 asset(s)>
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:add:L146 - Adding <tzis_template catalog with 480 dataset(s) from 480 asset(s)> to cache, uuid 329a34f5-4eb0-40de-a4e0-089b3a43e7e2
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L314 - Container: xarray, ID: 329a34f5-4eb0-40de-a4e0-089b3a43e7e2
ipython
import intake
import intake_esm
import intake_xarray
test=intake.open_catalog("intake://localhost:8898")
list(test)
Out[22]: ['dkrz_cmip6_cloud_zarr']
In [24]: test["dkrz_cmip6_cloud_zarr"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [24], in <module>
----> 1 test["dkrz_cmip6_cloud_zarr"]
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:436, in Catalog.__getitem__(self, key)
427 """Return a catalog entry by name.
428
429 Can also use attribute syntax, like ``cat.entry_name``, or
(...)
432 cat['name1', 'name2']
433 """
434 if not isinstance(key, list) and key in self:
435 # triggers reload_on_change
--> 436 s = self._get_entry(key)
437 if s.container == 'catalog':
438 s.name = key
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/utils.py:45, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
42 @functools.wraps(f)
43 def wrapper(self, *args, **kwargs):
44 self.reload()
---> 45 return f(self, *args, **kwargs)
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:323, in Catalog._get_entry(self, name)
321 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
322 entry._user_parameters = ups + (entry._user_parameters or [])
--> 323 return entry()
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/entry.py:77, in CatalogEntry.__call__(self, persist, **kwargs)
75 raise ValueError('Persist value (%s) not understood' % persist)
76 persist = persist or self._pmode
---> 77 s = self.get(**kwargs)
78 if persist != 'never' and isinstance(s, PersistMixin) and s.has_been_persisted:
79 from ..container.persist import store
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:459, in RemoteCatalogEntry.get(self, **user_parameters)
457 http_args['headers'] = self.http_args['headers'].copy()
458 http_args['headers'].update(self.auth.get_headers())
--> 459 return open_remote(
460 self.url, self.name, container=self.container,
461 user_parameters=user_parameters, description=self.description,
462 http_args=http_args,
463 page_size=self._page_size,
464 auth=self.auth,
465 getenv=self.getenv,
466 persist_mode=self.catalog_pmode,
467 getshell=self.getshell)
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:515, in open_remote(url, entry, container, user_parameters, description, http_args, page_size, persist_mode, auth, getenv, getshell)
506 if container == 'catalog':
507 response.update({'auth': auth,
508 'getenv': getenv,
509 'getshell': getshell,
(...)
513 # TODO storage_options?
514 })
--> 515 source = container_map[container](url, http_args, **response)
516 source.description = description
517 return source
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake_xarray/xarray_container.py:91, in RemoteXarray.__init__(self, url, headers, **kwargs)
78 """
79 Initialise local xarray, whose dask arrays contain tasks that pull data
80
(...)
88 server.
89 """
90 import xarray as xr
---> 91 super(RemoteXarray, self).__init__(url, headers, **kwargs)
92 self._schema = None
93 self._ds = xr.open_zarr(self.metadata['internal'])
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:44, in RemoteSource.__init__(self, url, headers, name, parameters, metadata, **kwargs)
42 self._source_id = None
43 self.metadata = metadata or {}
---> 44 self._get_source_id()
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:55, in RemoteSource._get_source_id(self)
53 req.raise_for_status()
54 response = msgpack.unpackb(req.content, **unpack_kwargs)
---> 55 self._parse_open_response(response)
File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:58, in RemoteSource._parse_open_response(self, response)
57 def _parse_open_response(self, response):
---> 58 dtype_descr = response['dtype']
59 if isinstance(dtype_descr, list):
60 # Reformat because NumPy needs list of tuples
61 dtype_descr = [tuple(x) for x in response['dtype']]
KeyError: 'dtype'
Version information: output of intake_esm.show_versions()
Paste the output of intake_esm.show_versions() here:
import intake_esm
intake_esm.show_versions()
INSTALLED VERSIONS
------------------
cftime: 1.5.2
dask: 2022.01.1
fastprogress: 0.2.7
fsspec: 2022.01.0
gcsfs: None
intake: 0.6.5
intake_esm: 2021.8.17
netCDF4: 1.5.8
pandas: 1.3.5
requests: 2.27.1
s3fs: None
xarray: 0.21.1
zarr: 2.11.0
Thank you for the reproducible example, @wachsylon! I will look into this and will get back to you
@andersy005 thank you i appreciate it.
@andersy005 Do you have found sth? It would be really helpful as it is a blocker for me. Thanks in advance!
Thank you for your patience, @wachsylon! Unfortunately, I haven't had time to look into the root cause of this issue.
I tried out the intake-server. In the end, I would like to have a server for some or all cats of: swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm
I'm curious.... What are the benefits of exposing these catalogs via an intake-server instead of a regular top-level/main catalog?
I am imagining a top-level YAML file with the following contents. Users should be able to point intake to this main/parent catalog
description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
source:
- module: intake_esm
sources:
dkrz_cmip6_cloud_zarr:
args:
esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
driver: intake_esm.esm_datastore
another_catalog:
args:
esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
driver: intake_esm.esm_datastore
In [18]: import intake
In [19]: cat = intake.open_catalog("temp.yaml")
In [20]: list(cat)
Out[20]: ['dkrz_cmip6_cloud_zarr']
In [21]: esmcat = cat["dkrz_cmip6_cloud_zarr"]
In [22]: esmcat.df.head()
Out[22]:
prefix ... version
0 CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1... ... v20181218
1 CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1... ... v20181218
2 CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1... ... v20181218
3 CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1... ... v20181218
4 CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1... ... v20181218
[5 rows x 12 columns]
@andersy005
The main use case for an intake server would be to return subsets of catalogs in case users cannot handle the memory. Our data base for the CMIP6 catalog is about 4PB which gives me a .csv.gz list of about 400MB. If this is loaded entirely, users quickly exceed the available memory.
I know that we could create a hierarchy of catalogs and create catalogs on finer level but that may not fit to many use cases as e.g. in CMIP6, users are interested in several MIPs (=activity, e.g. ScenarioMIP or PMIP) at once. We could also wait for a STAC solution but at MPI-M, intake is and will be used a lot anyway so that I would like to get intake server to work. I also could not start an intake server for Pangeo btw.
If I understood correctly, the intake server can cache the catalog. Therefore, the server only loads the catalog for many requests, correct? I can set up a VM which users can use to subset the catalog.
Thank you for the clarification/details, @wachsylon! I haven't used the intake-server before. From my short experimentation, it appears that there are some things that are missing within intake-esm to allow seamless integration with the intake-server I'll do my best to find time to look into this today/tomorrow.
Any news on that?
some things that are missing within intake-esm to allow seamless integration with the intake-server
sounds bad :(
Serving intake-esm catalog and assets via intake-server will require rewriting some of the components of intake-esm. Unfortunately, my schedule is too tight and I don't have time to look into this extensively any time soon but I'd be happy to review pull requests if someone is interested in pursuing this...
Ok thanks for letting me know! And thanks a lot for your help!