hydromt
hydromt copied to clipboard
Draft: support CMIP6 from Google Cloud
This PR set different steps to support CMIP6 data from Google Cloud Storage:
- [x] new predefined catalog for remote data:
cmip6_data
- [x] support for remote data with a yml meta flag for
remote
and a per source newfilesystem
attribute. - [x] use fsspec and upath in
resolve_path
. Additionnal dependency to upath and gcsfs for Google Cloud - [x] direct support for CMIP6 dataset requires a new
harmonise_dims
preprocess function - [x] RasterDataset zarr driver: possiblility to read from several zarr stores. Enable PREPROCESSORS before merging
- [x] add test
- [x] instead of remote flag try fixing abs_path method in data_catalog (use uri_validator) or at from_dict level before abs_path
- [x] for gs: fix in glob maybe look at https://github.com/fsspec/universal_pathlib (s3fs has the same problem)
- [ ] add docs
- [ ] optional: add example
@DirkEilander : I implemented some new steps to be able to directly access CMIP6 data from Google Cloud Storage using the pangeo example you sent. It seems to work now but before I go further with testing and documentation, I would really like to have your first comments on it
About universal_pathlib: I tried using UPath only in resolve_path method (commented out in commit https://github.com/Deltares/hydromt/pull/250/commits/e5e756923e89007c8b9318c7abd9903aec22a52e). I think the main issue is that with some filesystems, the first part of the path is not understood correctly (pathlib root/driver/anchor are empty for gcs). Going around is what I found:
- compared to fsspec (which is used in the background) is that it uses
pathlib.Path.glob
method which is of type path.glob("*") instead of glob(path). Which means the path needs to be properly split. - If '' are at the directory levels and not filename (eg: "gs://cmip6/CMIP6/CMIP/{model}/historical/{member}/{timestep}/{variable}//"), there is no easy way to get the last part of the path ("/") than using
pathlib.Path.parent
andpathlib.Path.parts
and recollating together (splitting str with * causes problems as well as parts of the path can become incomplete if "/tas/variable_*.py"). - When using
pathlib.Path.parts
to recreate the last part of the path, problem is then that the filesystem is lost and goes back to os by default (would get \ instead of / on windows which results in no file found). - You can access the specific
sep
of a filesystem in pathlib withPath._flavour.sep
but this is not in the official API... - Pathlib method like
Path.relative_to
would strangely give "gs://cmip6//"" instead of just "/".
In the end I could still use UPath and joinpath
in order to correct that the prefix "gs://" is lost when using fsspec glob method.
Thanks for the update @hboisgon!
Some thoughts/ suggestions:
- I'd be nice if we could directly include other filesystems like s3 in this PR
- Ideally the specific filesystem packages (e.g. gssfs) should not be a required dependency. We could add a list of optional 'cloud' dependencies. But than we need to make sure that hydromt runs without these dependencies as well. I can do some recoding to make this work if you want me to.
- The fix with UPath you applied, would that work with normal Path as well?
Other interesting open datasets for our reference:
- ERA5 (GS) https://cloud.google.com/storage/docs/public-datasets/era5
- Copernicus DEM (S3) https://registry.opendata.aws/copernicus-dem/
Hi @DirkEilander !
Agreed we can use this PR to also include other filesystems and a bit more data. I also thought it would be nice to have gcsfs, s3fs as optional dependencies. We could add some steps in the try/except part of resolve path where we import the different packages.
The fix with UPath should work for other filesystems as well. It basically replaces the first path of the path which is incomplete after going through glob. I assume also what may happen with s3 but didn't test it yet.
Does not really matter if you or I continue with this PR, whatever you prefer :)
Note: also tested compatibility with CST scripts to derive climate statistics and with hydromt_wflow to use to prepare precip and temp_pet forcing.
I think this PR is ready and in the changelog I flagged remote data as experimental. I throw error messages for all non supported or tested drivers (except DataFrame but this one should really work out of the box). I tested the two datasets you mentioned Dirk:
- Copernicus DEM (S3): no *.vrt files is available and individual tiles are stored in separate folders. This makes finding the data and reading very slow. So I skipped for now.
- ESA Worldcover (S3): this one does have a *.vrt file for version v100/2020 so can be used directly :) it is missing for v200/2021 so same problem as Copernicus DEM
- ERA5 (GCS): very easy to read but dimension are (time, values) with longitude and latitude in the coordinates. So either needs regridding or maybe the rotated grid PR to be useful.