hydromt icon indicating copy to clipboard operation
hydromt copied to clipboard

Draft: support CMIP6 from Google Cloud

Open hboisgon opened this issue 2 years ago • 5 comments

This PR set different steps to support CMIP6 data from Google Cloud Storage:

  • [x] new predefined catalog for remote data: cmip6_data
  • [x] support for remote data with a yml meta flag for remote and a per source new filesystem attribute.
  • [x] use fsspec and upath in resolve_path. Additionnal dependency to upath and gcsfs for Google Cloud
  • [x] direct support for CMIP6 dataset requires a new harmonise_dims preprocess function
  • [x] RasterDataset zarr driver: possiblility to read from several zarr stores. Enable PREPROCESSORS before merging
  • [x] add test
  • [x] instead of remote flag try fixing abs_path method in data_catalog (use uri_validator) or at from_dict level before abs_path
  • [x] for gs: fix in glob maybe look at https://github.com/fsspec/universal_pathlib (s3fs has the same problem)
  • [ ] add docs
  • [ ] optional: add example

hboisgon avatar Nov 29 '22 09:11 hboisgon

@DirkEilander : I implemented some new steps to be able to directly access CMIP6 data from Google Cloud Storage using the pangeo example you sent. It seems to work now but before I go further with testing and documentation, I would really like to have your first comments on it

hboisgon avatar Nov 29 '22 09:11 hboisgon

About universal_pathlib: I tried using UPath only in resolve_path method (commented out in commit https://github.com/Deltares/hydromt/pull/250/commits/e5e756923e89007c8b9318c7abd9903aec22a52e). I think the main issue is that with some filesystems, the first part of the path is not understood correctly (pathlib root/driver/anchor are empty for gcs). Going around is what I found:

  • compared to fsspec (which is used in the background) is that it uses pathlib.Path.glob method which is of type path.glob("*") instead of glob(path). Which means the path needs to be properly split.
  • If '' are at the directory levels and not filename (eg: "gs://cmip6/CMIP6/CMIP/{model}/historical/{member}/{timestep}/{variable}//"), there is no easy way to get the last part of the path ("/") than using pathlib.Path.parent and pathlib.Path.parts and recollating together (splitting str with * causes problems as well as parts of the path can become incomplete if "/tas/variable_*.py").
  • When using pathlib.Path.parts to recreate the last part of the path, problem is then that the filesystem is lost and goes back to os by default (would get \ instead of / on windows which results in no file found).
  • You can access the specific sep of a filesystem in pathlib with Path._flavour.sep but this is not in the official API...
  • Pathlib method like Path.relative_to would strangely give "gs://cmip6//"" instead of just "/".

In the end I could still use UPath and joinpath in order to correct that the prefix "gs://" is lost when using fsspec glob method.

hboisgon avatar Dec 09 '22 06:12 hboisgon

Thanks for the update @hboisgon!

Some thoughts/ suggestions:

  • I'd be nice if we could directly include other filesystems like s3 in this PR
  • Ideally the specific filesystem packages (e.g. gssfs) should not be a required dependency. We could add a list of optional 'cloud' dependencies. But than we need to make sure that hydromt runs without these dependencies as well. I can do some recoding to make this work if you want me to.
  • The fix with UPath you applied, would that work with normal Path as well?

Other interesting open datasets for our reference:

  • ERA5 (GS) https://cloud.google.com/storage/docs/public-datasets/era5
  • Copernicus DEM (S3) https://registry.opendata.aws/copernicus-dem/

DirkEilander avatar Dec 09 '22 07:12 DirkEilander

Hi @DirkEilander !

Agreed we can use this PR to also include other filesystems and a bit more data. I also thought it would be nice to have gcsfs, s3fs as optional dependencies. We could add some steps in the try/except part of resolve path where we import the different packages.

The fix with UPath should work for other filesystems as well. It basically replaces the first path of the path which is incomplete after going through glob. I assume also what may happen with s3 but didn't test it yet.

Does not really matter if you or I continue with this PR, whatever you prefer :)

hboisgon avatar Dec 09 '22 07:12 hboisgon

Note: also tested compatibility with CST scripts to derive climate statistics and with hydromt_wflow to use to prepare precip and temp_pet forcing.

hboisgon avatar Dec 09 '22 08:12 hboisgon

I think this PR is ready and in the changelog I flagged remote data as experimental. I throw error messages for all non supported or tested drivers (except DataFrame but this one should really work out of the box). I tested the two datasets you mentioned Dirk:

  • Copernicus DEM (S3): no *.vrt files is available and individual tiles are stored in separate folders. This makes finding the data and reading very slow. So I skipped for now.
  • ESA Worldcover (S3): this one does have a *.vrt file for version v100/2020 so can be used directly :) it is missing for v200/2021 so same problem as Copernicus DEM
  • ERA5 (GCS): very easy to read but dimension are (time, values) with longitude and latitude in the coordinates. So either needs regridding or maybe the rotated grid PR to be useful.

hboisgon avatar Feb 10 '23 06:02 hboisgon