intake-esm icon indicating copy to clipboard operation
intake-esm copied to clipboard

intake server leads to ERROR: KeyError('xarray')

Open wachsylon opened this issue 3 years ago • 9 comments

Description

I tried out the intake-server. In the end, I would like to have a server for some or all cats of: https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/ The entry catalog is dkrz_data-pool_cloudcatalog.yaml

I started the server from the same environment as from where I did client commands. I installed intake_xarray as well.

Do you have an idea what the problem is? I saw that the remote catalog wants to use sth like container: xarray what I do not really understand. What is a container? Why xarray?

What I Did

cat > temp.yaml <<EOF
description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
  source:
    - module: intake_esm

sources:
  dkrz_cmip6_cloud_zarr:
    args:
      esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
    description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
    driver:
    - intake.open_esm_datastore
>EOF
intake-server temp.yaml 1>log 2>&1 &
intake list --full intake://localhost:8898 
ERROR: KeyError('xarray')

cat log
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L53 - Creating catalog from:
2022-02-10 17:33:05,850 - intake - INFO - __main__.py:main:L55 -   - temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L62 - catalog_args: temp.yaml
2022-02-10 17:33:06,509 - intake - INFO - __main__.py:main:L70 - Listening on localhost:8898
2022-02-10 17:33:06,509 - intake - DEBUG - server.py:__init__:L32 - auth: {'cls': 'intake.auth.base.BaseAuth'}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L241 - Source POST: {'action': 'open', 'name': 'dkrz_cmip6_cloud_zarr', 'parameters': {}, 'available_plugins': ['yaml_file_cat', 'yaml_files_cat', 'netcdf', 'opendap', 'rasterio', 'remote-xarray', 'xarray_image', 'zarr', 'alias', 'catalog', 'csv', 'intake_remote', 'json', 'jsonl', 'ndzarr', 'numpy', 'textfiles', 'tiled', 'tiled_cat', 'zarr_cat', 'esm_datastore', 'esm_group', 'esm_single_source']}
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L302 - Opening entry <tzis_template catalog with 480 dataset(s) from 480 asset(s)>
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:add:L146 - Adding <tzis_template catalog with 480 dataset(s) from 480 asset(s)> to cache, uuid 329a34f5-4eb0-40de-a4e0-089b3a43e7e2
2022-02-10 17:33:38,289 - intake - DEBUG - server.py:post:L314 - Container: xarray, ID: 329a34f5-4eb0-40de-a4e0-089b3a43e7e2

ipython
import intake
import intake_esm
import intake_xarray
test=intake.open_catalog("intake://localhost:8898")
list(test)
Out[22]: ['dkrz_cmip6_cloud_zarr']

In [24]: test["dkrz_cmip6_cloud_zarr"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [24], in <module>
----> 1 test["dkrz_cmip6_cloud_zarr"]

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:436, in Catalog.__getitem__(self, key)
    427 """Return a catalog entry by name.
    428 
    429 Can also use attribute syntax, like ``cat.entry_name``, or
   (...)
    432 cat['name1', 'name2']
    433 """
    434 if not isinstance(key, list) and key in self:
    435     # triggers reload_on_change
--> 436     s = self._get_entry(key)
    437     if s.container == 'catalog':
    438         s.name = key

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/utils.py:45, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
     42 @functools.wraps(f)
     43 def wrapper(self, *args, **kwargs):
     44     self.reload()
---> 45     return f(self, *args, **kwargs)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/base.py:323, in Catalog._get_entry(self, name)
    321 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
    322 entry._user_parameters = ups + (entry._user_parameters or [])
--> 323 return entry()

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/entry.py:77, in CatalogEntry.__call__(self, persist, **kwargs)
     75     raise ValueError('Persist value (%s) not understood' % persist)
     76 persist = persist or self._pmode
---> 77 s = self.get(**kwargs)
     78 if persist != 'never' and isinstance(s, PersistMixin) and s.has_been_persisted:
     79     from ..container.persist import store

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:459, in RemoteCatalogEntry.get(self, **user_parameters)
    457 http_args['headers'] = self.http_args['headers'].copy()
    458 http_args['headers'].update(self.auth.get_headers())
--> 459 return open_remote(
    460     self.url, self.name, container=self.container,
    461     user_parameters=user_parameters, description=self.description,
    462     http_args=http_args,
    463     page_size=self._page_size,
    464     auth=self.auth,
    465     getenv=self.getenv,
    466     persist_mode=self.catalog_pmode,
    467     getshell=self.getshell)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/catalog/remote.py:515, in open_remote(url, entry, container, user_parameters, description, http_args, page_size, persist_mode, auth, getenv, getshell)
    506     if container == 'catalog':
    507         response.update({'auth': auth,
    508                          'getenv': getenv,
    509                          'getshell': getshell,
   (...)
    513                          # TODO storage_options?
    514                          })
--> 515     source = container_map[container](url, http_args, **response)
    516 source.description = description
    517 return source

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake_xarray/xarray_container.py:91, in RemoteXarray.__init__(self, url, headers, **kwargs)
     78 """
     79 Initialise local xarray, whose dask arrays contain tasks that pull data
     80 
   (...)
     88 server.
     89 """
     90 import xarray as xr
---> 91 super(RemoteXarray, self).__init__(url, headers, **kwargs)
     92 self._schema = None
     93 self._ds = xr.open_zarr(self.metadata['internal'])

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:44, in RemoteSource.__init__(self, url, headers, name, parameters, metadata, **kwargs)
     42 self._source_id = None
     43 self.metadata = metadata or {}
---> 44 self._get_source_id()

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:55, in RemoteSource._get_source_id(self)
     53 req.raise_for_status()
     54 response = msgpack.unpackb(req.content, **unpack_kwargs)
---> 55 self._parse_open_response(response)

File ~/.conda/envs/appmode/lib/python3.10/site-packages/intake/container/base.py:58, in RemoteSource._parse_open_response(self, response)
     57 def _parse_open_response(self, response):
---> 58     dtype_descr = response['dtype']
     59     if isinstance(dtype_descr, list):
     60         # Reformat because NumPy needs list of tuples
     61         dtype_descr = [tuple(x) for x in response['dtype']]

KeyError: 'dtype'


Version information: output of intake_esm.show_versions()

Paste the output of intake_esm.show_versions() here:

import intake_esm

intake_esm.show_versions()

INSTALLED VERSIONS
------------------

cftime: 1.5.2
dask: 2022.01.1
fastprogress: 0.2.7
fsspec: 2022.01.0
gcsfs: None
intake: 0.6.5
intake_esm: 2021.8.17
netCDF4: 1.5.8
pandas: 1.3.5
requests: 2.27.1
s3fs: None
xarray: 0.21.1
zarr: 2.11.0


wachsylon avatar Feb 10 '22 16:02 wachsylon

Thank you for the reproducible example, @wachsylon! I will look into this and will get back to you

andersy005 avatar Feb 10 '22 20:02 andersy005

@andersy005 thank you i appreciate it.

wachsylon avatar Feb 15 '22 14:02 wachsylon

@andersy005 Do you have found sth? It would be really helpful as it is a blocker for me. Thanks in advance!

wachsylon avatar Feb 21 '22 17:02 wachsylon

Thank you for your patience, @wachsylon! Unfortunately, I haven't had time to look into the root cause of this issue.

I tried out the intake-server. In the end, I would like to have a server for some or all cats of: swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm

I'm curious.... What are the benefits of exposing these catalogs via an intake-server instead of a regular top-level/main catalog?

I am imagining a top-level YAML file with the following contents. Users should be able to point intake to this main/parent catalog

description: 'DKRZ master catalog for all data pool catalogs available'
plugins:
  source:
    - module: intake_esm

sources:
  dkrz_cmip6_cloud_zarr:
    args:
      esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
    description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
    driver: intake_esm.esm_datastore

    another_catalog:
      args:
        esmcol_obj: https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_cmip6_swift_zarr_fromcloud.json
      description: dkrz cmip6 data on disk saved as netcdf retrieved fromcloud
      driver: intake_esm.esm_datastore

In [18]: import intake

In [19]: cat = intake.open_catalog("temp.yaml")

In [20]: list(cat)
Out[20]: ['dkrz_cmip6_cloud_zarr']

In [21]: esmcat = cat["dkrz_cmip6_cloud_zarr"]

In [22]: esmcat.df.head()
Out[22]: 
                                              prefix  ...    version
0  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
1  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
2  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
3  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218
4  CMIP6.CMIP.AWI.AWI-CM-1-1-MR.historical.r1i1p1...  ...  v20181218

[5 rows x 12 columns]

andersy005 avatar Feb 22 '22 22:02 andersy005

@andersy005 The main use case for an intake server would be to return subsets of catalogs in case users cannot handle the memory. Our data base for the CMIP6 catalog is about 4PB which gives me a .csv.gz list of about 400MB. If this is loaded entirely, users quickly exceed the available memory.

I know that we could create a hierarchy of catalogs and create catalogs on finer level but that may not fit to many use cases as e.g. in CMIP6, users are interested in several MIPs (=activity, e.g. ScenarioMIP or PMIP) at once. We could also wait for a STAC solution but at MPI-M, intake is and will be used a lot anyway so that I would like to get intake server to work. I also could not start an intake server for Pangeo btw.

If I understood correctly, the intake server can cache the catalog. Therefore, the server only loads the catalog for many requests, correct? I can set up a VM which users can use to subset the catalog.

wachsylon avatar Feb 23 '22 13:02 wachsylon

Thank you for the clarification/details, @wachsylon! I haven't used the intake-server before. From my short experimentation, it appears that there are some things that are missing within intake-esm to allow seamless integration with the intake-server I'll do my best to find time to look into this today/tomorrow.

andersy005 avatar Feb 23 '22 14:02 andersy005

Any news on that?

some things that are missing within intake-esm to allow seamless integration with the intake-server

sounds bad :(

wachsylon avatar Mar 01 '22 17:03 wachsylon

Serving intake-esm catalog and assets via intake-server will require rewriting some of the components of intake-esm. Unfortunately, my schedule is too tight and I don't have time to look into this extensively any time soon but I'd be happy to review pull requests if someone is interested in pursuing this...

andersy005 avatar Mar 02 '22 17:03 andersy005

Ok thanks for letting me know! And thanks a lot for your help!

wachsylon avatar Mar 03 '22 12:03 wachsylon