earthkit-data icon indicating copy to clipboard operation
earthkit-data copied to clipboard

Issue with some zipped vector data

Open malmans2 opened this issue 10 months ago • 5 comments

What happened?

We are having issues opening some zipped vector data. In the snippet below, earthkit.data works fine with version="rgi_6_0", but it raises an error with version="rgi_7_0".

What are the steps to reproduce the bug?

import earthkit.data

dataset = "insitu-glaciers-extent"
request = {
    "variable": "glacier_area",
    "product_type": "vector",
}

for version in ("rgi_6_0", "rgi_7_0"):
    ds = earthkit.data.from_source("cds", dataset, request | {"version": version})
    try:
        df = ds.to_pandas()
    except Exception as exc:
        print(f"{version = }: {exc!s}")
        raise
    else:
        print(f"{version = }: OK!")

Version

0.12.1

Platform (OS and architecture)

Darwin MacBook-Pro-di-Bopen.local 24.3.0 Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:24 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6030 arm64

Relevant log output

Unknown file type, no reader available. path=/var/folders/z4/9f32__x92kl340wxp0m4hfym0000gp/T/tmp4_wsms_8/cds-023bde4eed526ccb72379966cbf08d5cfebea278f82f2143efc49ec801b34338.d/rgi2000_v70_vector.shp magic=b"\x00\x00'\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00$\xceo\xa4\xe8\x03\x00\x00\x0f\x00\x00\x00#h\xcc$\xea}f\xc0;\x1c]\xa5\xbb\x93S\xc0\x96A\xb5\xc1\txf@\x96\xcc\xb1\xbc" content_type=None
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[1], line 12
     10 ds = earthkit.data.from_source("cds", dataset, request | {"version": version})
     11 try:
---> 12     df = ds.to_pandas()
     13 except Exception as exc:
     14     print(f"{version = }: {exc!s}")

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/core/__init__.py:50, in Base.to_pandas(self, **kwargs)
     47 @abstractmethod
     48 def to_pandas(self, **kwargs):
     49     """Converts into a pandas dataframe"""
---> 50     self._not_implemented()

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/core/__init__.py:155, in Base._not_implemented(self)
    153 if hasattr(self, "path"):
    154     extra = f" on {self.path}"
--> 155 raise NotImplementedError(f"{module}.{name}.{func}(){extra}")

NotImplementedError: earthkit.data.sources.empty.EmptySource.to_pandas()

Accompanying data

No response

Organisation

B-Open/EQC

malmans2 avatar Feb 06 '25 14:02 malmans2

@malmans2, thank you for reporting this issue. When I try to run it with:

earthkit-data develop cdsapi 0.7.4

I get the following error for both versions:

HTTPError: 400 Client Error: Bad Request for url: https://cds.climate.copernicus.eu/api/retrieve/v1/processes/insitu-glaciers-extent/execution
invalid request
Request has not produced a valid combination of values, please check your selection.
{'variable': 'glacier_area', 'product_type': 'vector', 'version': 'r'}

sandorkertesz avatar Feb 10 '25 09:02 sandorkertesz

Are you sure you are running the exact same snippet I sent you? Looks like your version is incorrect: 'version': 'r'. It should be either "rgi_6_0" or "rgi_7_0".

Maybe you have a bug in your code and you are iterating over a string rather than an iterable of strings?

malmans2 avatar Feb 10 '25 09:02 malmans2

I am sorry, you were right I used a wrong request. Now I am able to reproduce the error.

sandorkertesz avatar Feb 10 '25 10:02 sandorkertesz

@malmans2, earthkit-data cannot read the retrieved shapefile for the "rgi_7_0" version because it cannot identify it as a valid shape file. A shapefile consists of multiple files, and in earthkit-data 3 of these are expected to be present with the following suffixes:

MANDATORY = (".shp", ".shx", ".dbf")

If we look at the content of the downloaded data after extracted into a directory we see this:

260805380 10 Feb 10:52 rgi2000 v70_vector.dbf
145 10 Feb 10:52 rgi2000_v70_vector.prj
1235017544 10 Feb 10:52 rgi2000_v70_vector.shp
2196348 10 Feb 10:52 rgi2000_v70_vector.shx

Here the filename rgi2000 v70_vector.dbf is incorrect because it contains a whitespace instead of an underscore, "2000 v70" instead of "2000_v70"

I can see 2 possible solutions to this:

  • the filename is fixed on the CDS side
  • earthkit-data should relax its checks and should allow small differences in the shapefile filenames

sandorkertesz avatar Feb 11 '25 19:02 sandorkertesz

Got it, thanks for the details. I will inform the EQC evaluator and the CDS technical officer.

malmans2 avatar Feb 12 '25 09:02 malmans2

@malmans2, I presume this issue can be closed. Please reopen it if there is anything to do on the earthkit side.

sandorkertesz avatar Jun 16 '25 08:06 sandorkertesz