atlite
atlite copied to clipboard
Enable download of large (spatial extent) cutouts from ERA5 via cdsapi.
Closes #221 .
Change proposed in this Pull Request
Split download of ERA5 into monthly downloads (currently: annual downloads) to prevent too-large downloads from ERA5 CDSAPI.
TODO
- [x] Add month indicator to progress prompts.
Description
Motivation and Context
See #221 .
How Has This Been Tested?
Locally by downloading a large cutout.
Type of change
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [n/a] Breaking change (fix or feature that would cause existing functionality to change)
Checklist
- [x] I tested my contribution locally and it seems to work fine.
- [x] I locally ran
pytest
inside the repository and no unexpected problems came up. - [x] I have adjusted the docstrings in the code appropriately.
- [x] I have documented the effects of my code changes in the documentation
doc/
. - [n/a] I have added newly introduced dependencies to
environment.yaml
file. - [x] I have added a note to release notes
doc/release_notes.rst
. - [x] I have used
pre-commit run --all
to lint/format/check my contribution
How does that interact with queuing at CDSAPI? Does that increase the chances of getting stuck in the request in month 9 or so?
I don't know.
The downloads for the larger cutouts worked relatively smoothly (1-2 hours), but the number of requests is 12x higher for a normal year, so the chances might be higher. On the other hand, since the downloaded slices are smaller I would not expect major performance changes. Probably acceptable, since you're not downloading cutouts on an everyday basis.
I don't know enough about the internals of the ERA5 climate store and I don't think we should optimise our retrieval routines for it as long as we haven't received any complaints for bad performance.
Alright. I did not encounter any issues downloading large datasets. Seems to work nicely @FabianHofmann .
What would be helpful is a message indicating which month/year combination is currently being downloaded, do you have an idea on how to easily implement this @FabianHofmann ?
Then I'd suggest @davide-f tries to download his cutout as well and if that works without issues then we can merge.
@euronion Super! thank you very much. Currently, I am a bit busy with other stuff and I cannot run the machine with copernicus waiting long time for the analysis, unfortunately. As I have free resources, I'll test that. Thank you!
Great. For the logging I would suggest to go with e.g. "2013-01", instead of "2013" only. See https://github.com/PyPSA/atlite/blob/3c7b4b8e4fc24f7cad2718abe7ead97d67f16550/atlite/datasets/era5.py#L309 which could be changed into
timestr = f"{request["year"])}-{request["month"]}"
and changed replaced accordingly in https://github.com/PyPSA/atlite/blob/3c7b4b8e4fc24f7cad2718abe7ead97d67f16550/atlite/datasets/era5.py#L311
As discussed with @euronion, I'll wait for his latest updates by the end of the week (estimate), and I'll run the model for the entire world.
As a comment, the "number of slices", currently one a month, may be a parameter as well. Anyway, we could keep the current implementation and see if it works for the world, fingers crossed.
@davide-f You're good to give it a try!
Regarding your comment:
I had a look at the code and if I get the intention behind the comment correct (optimising the retrieval) then it might be easier to implement a heuristic which calculates the number of points being retrieved (np.prod([len(v) for k,v in request.items()])
) and adjusts it automatically such that the request will safely not fail (request size below the size at which CDSAPI breaks) than to have a parameter to adjust it.
If it works for you @davide-f and the time it takes is acceptable (please report it as well if you can) then I'd stay away from overoptimising this aspect and just keep the monthly retrieval.
@euronion the branch is running :) I'll track it and update you as I have news. Just as a comment, I had to to few tests that have been interrupted, hence, since copernicus reduce priority to users' requests the more the same user is using the service, that may lead to a slight overestimation of the total expected time, though I don't think it is an issue.
I totally agree on seeing if the monthly retrieval works fine and it's expected time. I fear that it may take very long times though. I'll notify you as I have news :)
I confirm that the first 1-month chunk has been downloaded. I'll be waiting for the entire procedure to end and let you know :)
@euronion The procedure for the world (+- 180° lat lon) completed in 5 to 12 hours (I run it twice) successfully and produced an output file of 380Gb (large but we are speaking of a lot of data), see the settings below.
atlite:
nprocesses: 4
cutouts:
# geographical bounds automatically determined from countries input
world-2013-era5:
module: era5
dx: 0.3 # cutout resolution
dy: 0.3 # cutout resolution
# Below customization options are dealt in an automated way depending on
# the snapshots and the selected countries. See 'build_cutout.py'
time: ["2013-01-01", "2014-01-01"] # specify different weather year (~40 years available)
x: [-180., 180.] # manual set cutout range
y: [-180., 180.] # manual set cutout range
As a recommendation, to silence some warning, if interested, the following comment was risen:
To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array[indexer]
return self.array[key]
/home/davidef/miniconda3/envs/pypsa-africa/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array[indexer]
The output also makes sense, however, it has some weird white bands, though I don't think this is related to this PR, what do you think?
As discussed, for efficiency purposes, it may be interesting to decide the number of chunks to divide the output. Since at world scale worked, we could specify the number of chunks as a number between 1 and 12, and we divide the blocks by months, e.g. 4 chunks: months 1-3, 4-6, 7-9 and 10-12. For small data to downloading, it may be more efficient to download everything in one go; for Africa or Europe for example there is no need to split the data; yet this is a detail as long as it works
- [ ] Think about heuristic to download in smaller/larger chunks depending on data geographical scope to download
- [x] Add note to documentation on how to compress cutouts
I attempted to compress cutouts during/after creation but without much success. using zlib
integration of xarray
the compressed cutouts unfortunately always increased in size (rather than decreasing). Using native netCDF
tools compression of cutouts to 30-50% of size is possible without impacts on atlite
performance. I want to add notes on this to the documentation with this PR as this allows for larger cutouts.
I would have preferred a solution where compression is done by atlite
directly, but it seems like that does not work well using xarray
.
Codecov Report
Patch coverage: 91.66
% and project coverage change: -0.09
:warning:
Comparison is base (
f9bd7fd
) 72.83% compared to head (d9f3bff
) 72.74%.
Additional details and impacted files
@@ Coverage Diff @@
## master #236 +/- ##
==========================================
- Coverage 72.83% 72.74% -0.09%
==========================================
Files 19 19
Lines 1590 1596 +6
Branches 227 270 +43
==========================================
+ Hits 1158 1161 +3
- Misses 362 363 +1
- Partials 70 72 +2
Impacted Files | Coverage Δ | |
---|---|---|
atlite/datasets/era5.py | 88.23% <88.88%> (-1.70%) |
:arrow_down: |
atlite/data.py | 86.36% <100.00%> (+0.31%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@davide-f If you wish to reduce the file size you can follow the instructions in the updated doc:
https://github.com/PyPSA/atlite/blob/230aa8a5b1b21bff8f03d23631f01e6ebf5d83b3/examples/create_cutout.ipynb
Should save ~50% :)
Month indicator has been added, e.g. info prompt during creation looks like this to indicate the month currently being retrieved
2022-09-06 14:14:27,779 INFO CDS: Downloading variables
* runoff (2012-12)
I suggest we offload the heuristic into a separate issue and tackle it if necessary. ATM I think it would be a nice but unnecessary feature.
RTR @FabianHofmann would you?
No idea why the CI keeps failing (no issues locally) and why it is continuing the old CI.yaml with Python 3.8 instead of 3.11