fetch data from zarr repo
Prb: CDS store is becoming really slow to fetch data from for projects concerning a small area of the glob (e.g. a mountain range) for many years.
- Solution 1: Optimize CDS queries
- Solution 2: find a zarr repository of the ERA5 Plevels and Surface such as the google one: - https://github.com/google-research/arco-era5 - https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3
Solution 2 It is easy to connect to the google storage Zarr archive. The current issue comes when trying to store the subset dataset to local disk. The method blows up memory and eventually crashes, even for a single day worth of data (1MB).
import xarray
ds = xarray.open_zarr(
'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
chunks={'time': 48, 'latitude': 10, 'longitude': 10},
storage_options=dict(token='anon'),
consolidated=True)
ds_plev = ds.sel(
time=slice('2021-01-01', '2021-01-02'),
latitude=slice(47,44),
longitude=slice(4, 7),
level=[700,750,800,850,900,950,1000])[['geopotential', 'temperature', 'u_compo
nent_of_wind','v_component_of_wind', 'specific_humidity']]
sub = ds_plev.load()
sub.to_netcdf('test.nc')
related discussion with Luke that led him to build his tool:
On 3 Dec 2024, at 17:50, Gregor Luke [email protected] wrote:
Hi Joel,
Here it is :) https://gitlab.renkulab.io/pamir/era5-downloader There’s a notebook that shows how to use it. You can also launch a Renku compute session to actually do the downloading: https://renkulab.io/projects/pamir/era5-downloader/sessions/new?autostart=1 This link will launch the session with the max resources that you have access to (2CPU, 8GB RAM, 64 GB disk).
On Renku, it’s also possible to mount an S3 bucket for storage, meaning you could actually download directly onto an S3 bucket.
The script is 10 seconds faster per day-time-step than before! So down from 7 to 5 days. And if you have access to multiple machines (eg renku sessions), or team up with someone you could reduce this quite a bit!
Just to give you an idea of how it works, it fetches the netCDF files rather than using the zarr files.
- downloads netCDF files from Google Cloud Store caching them locally
- read in cached files as xarray Dataset
- delete cached files (otherwise storage balloons quickly)
- write xarray Dataset to netCDF file in the specified path.
Let me know if you run into any bugs. This is something I’ll be using from now on too, so also a useful little tool for me :)
Cheers, Luke
On Dec 3, 2024, at 10:11, Joel Caduff-Fiddes [email protected] wrote:
Your a legend Luke - thanks!
J
On 3 Dec 2024, at 10:08, Gregor Luke [email protected] wrote:
Cool :) Glad it helped. I’m actually writing something now that should make it a lot quicker and more memory efficient since it fetches the raw files rather than using the Zarr data. Will send it your way once it’s usable.
Cheers, Luke
On Dec 3, 2024, at 10:00, Joel Caduff-Fiddes [email protected] wrote:
Hi Luke
Thanks so much that really cleared it up - I think it has most definitely solved my problem!
I was just confused why my seemingly small request was not running. But I can deal with single day dowwnloads and actually I can get all my data in 7 days right? Which is pretty damm good. This would have taken easily a month or more on previous CDS and some indeterminable amount of time on new one!
Getting it running now….
Cheers! Joel
On 2 Dec 2024, at 17:50, Gregor Luke [email protected] wrote:
Hi Joel,
The main problem, as you suspected, is in part the levels but also how they chunked the data. The chunks are as follows {time=1, levels=38, lat=720, lon=1440}, meaning that you have to download the entire globe even if you want a single pixel… it’s ridiculous. This means that instead of your request being 31 MB for a single day, it’s actually 1.3 GB in the backend. I’ve attached a notebook that you can run that demonstrates this. I could get this to download quickly by using multiple VMs that had large download bandwidth, meaning I could go from taking 7 days to 1 day. But downloads a huge amount of data just for a small domain.
Anyhoo, I hope this helps in some way even though it probably doesn’t solve your issue.
Cheers, Luke
On Dec 2, 2024, at 15:46, Joel Caduff-Fiddes [email protected] wrote:
Hey Luke,
Hope all good with you. Im currently trying to deal with the whole CDS screw up for ERA5 downloads and came across googles little stash :)
Actually it is probably google and others that brought the CDS infrastructure down!
Anyways I implemented something like below from here:
<social-icon-google-cloud-1200-630.png> ERA5 data | Cloud Storage | Google Cloud cloud.google.com
This works great for surface levels (1 month hourly data for around 7 variables takes 100sec) but not for equivalent volume of data in pressure levels (Ie instead of 7 variables I have 1 variable on 7 pressure levels, volume of data the same)
So I wondered if I am doing something fundamentally wrong with the .sel cal on levels (code is below the surface code) either times out or is killed on memory (I have 32gb on machine). I guess its because all data (ie 36 pressure levels) needs loading into memory before the .sel function works.
Tamara mentioned you also access the google dataset so if you have any insights into how you do this or why surface and pressure level retrievals seem to behave differently even if from the same zarr file - would be amazing!
Cheers
Joel
Btw I have actually lost switch account due to WSL decision to delete subscription so I will just get Evan or someone to upload my datasets.
Ive been using this for downloading big datsets (whole alps/ 50y) successfully - it took about 10 days I think
@ArcticSnow an update on a download job, what im doing:
luke code: https://pypi.org/project/era5-downloader/0.1.5/
large are HKH:
so far 5 years in 18h (full 1h data, all surf and plev data) - scales to around 10 days for 20years. I think this is better than CDS although havnt tried it recently.
Good to know. That quite a bit faster than CDS indeed. What I recall is that is downloads from the netcdf store and not the zarr, right? I may reconsider it then. It would be nice to pull data directly from zarr store and not use netcdf at all.