atlite icon indicating copy to clipboard operation
atlite copied to clipboard

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi.

Open euronion opened this issue 2 years ago • 17 comments

Closes #221 .

Change proposed in this Pull Request

Split download of ERA5 into monthly downloads (currently: annual downloads) to prevent too-large downloads from ERA5 CDSAPI.

TODO

  • [x] Add month indicator to progress prompts.

Description

Motivation and Context

See #221 .

How Has This Been Tested?

Locally by downloading a large cutout.

Type of change

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [n/a] Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • [x] I tested my contribution locally and it seems to work fine.
  • [x] I locally ran pytest inside the repository and no unexpected problems came up.
  • [x] I have adjusted the docstrings in the code appropriately.
  • [x] I have documented the effects of my code changes in the documentation doc/.
  • [n/a] I have added newly introduced dependencies to environment.yaml file.
  • [x] I have added a note to release notes doc/release_notes.rst.
  • [x] I have used pre-commit run --all to lint/format/check my contribution

euronion avatar May 16 '22 12:05 euronion

How does that interact with queuing at CDSAPI? Does that increase the chances of getting stuck in the request in month 9 or so?

fneum avatar May 17 '22 20:05 fneum

I don't know.

The downloads for the larger cutouts worked relatively smoothly (1-2 hours), but the number of requests is 12x higher for a normal year, so the chances might be higher. On the other hand, since the downloaded slices are smaller I would not expect major performance changes. Probably acceptable, since you're not downloading cutouts on an everyday basis.

I don't know enough about the internals of the ERA5 climate store and I don't think we should optimise our retrieval routines for it as long as we haven't received any complaints for bad performance.

euronion avatar May 17 '22 20:05 euronion

Alright. I did not encounter any issues downloading large datasets. Seems to work nicely @FabianHofmann .

What would be helpful is a message indicating which month/year combination is currently being downloaded, do you have an idea on how to easily implement this @FabianHofmann ?

Then I'd suggest @davide-f tries to download his cutout as well and if that works without issues then we can merge.

euronion avatar May 31 '22 12:05 euronion

@euronion Super! thank you very much. Currently, I am a bit busy with other stuff and I cannot run the machine with copernicus waiting long time for the analysis, unfortunately. As I have free resources, I'll test that. Thank you!

davide-f avatar May 31 '22 14:05 davide-f

Great. For the logging I would suggest to go with e.g. "2013-01", instead of "2013" only. See https://github.com/PyPSA/atlite/blob/3c7b4b8e4fc24f7cad2718abe7ead97d67f16550/atlite/datasets/era5.py#L309 which could be changed into

timestr = f"{request["year"])}-{request["month"]}"

and changed replaced accordingly in https://github.com/PyPSA/atlite/blob/3c7b4b8e4fc24f7cad2718abe7ead97d67f16550/atlite/datasets/era5.py#L311

FabianHofmann avatar May 31 '22 20:05 FabianHofmann

As discussed with @euronion, I'll wait for his latest updates by the end of the week (estimate), and I'll run the model for the entire world.

As a comment, the "number of slices", currently one a month, may be a parameter as well. Anyway, we could keep the current implementation and see if it works for the world, fingers crossed.

davide-f avatar Jun 15 '22 22:06 davide-f

@davide-f You're good to give it a try!

Regarding your comment: I had a look at the code and if I get the intention behind the comment correct (optimising the retrieval) then it might be easier to implement a heuristic which calculates the number of points being retrieved (np.prod([len(v) for k,v in request.items()])) and adjusts it automatically such that the request will safely not fail (request size below the size at which CDSAPI breaks) than to have a parameter to adjust it.

If it works for you @davide-f and the time it takes is acceptable (please report it as well if you can) then I'd stay away from overoptimising this aspect and just keep the monthly retrieval.

euronion avatar Jun 17 '22 07:06 euronion

@euronion the branch is running :) I'll track it and update you as I have news. Just as a comment, I had to to few tests that have been interrupted, hence, since copernicus reduce priority to users' requests the more the same user is using the service, that may lead to a slight overestimation of the total expected time, though I don't think it is an issue.

I totally agree on seeing if the monthly retrieval works fine and it's expected time. I fear that it may take very long times though. I'll notify you as I have news :)

davide-f avatar Jun 18 '22 06:06 davide-f

I confirm that the first 1-month chunk has been downloaded. I'll be waiting for the entire procedure to end and let you know :)

davide-f avatar Jun 18 '22 09:06 davide-f

@euronion The procedure for the world (+- 180° lat lon) completed in 5 to 12 hours (I run it twice) successfully and produced an output file of 380Gb (large but we are speaking of a lot of data), see the settings below.

atlite:
  nprocesses: 4
  cutouts:
    # geographical bounds automatically determined from countries input
    world-2013-era5:
      module: era5
      dx: 0.3  # cutout resolution
      dy: 0.3  # cutout resolution
      # Below customization options are dealt in an automated way depending on
      # the snapshots and the selected countries. See 'build_cutout.py'
      time: ["2013-01-01", "2014-01-01"]  # specify different weather year (~40 years available)
      x: [-180., 180.]  # manual set cutout range
      y: [-180., 180.]    # manual set cutout range

As a recommendation, to silence some warning, if interested, the following comment was risen:

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
/home/davidef/miniconda3/envs/pypsa-africa/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

The output also makes sense, however, it has some weird white bands, though I don't think this is related to this PR, what do you think? output

davide-f avatar Jun 20 '22 21:06 davide-f

As discussed, for efficiency purposes, it may be interesting to decide the number of chunks to divide the output. Since at world scale worked, we could specify the number of chunks as a number between 1 and 12, and we divide the blocks by months, e.g. 4 chunks: months 1-3, 4-6, 7-9 and 10-12. For small data to downloading, it may be more efficient to download everything in one go; for Africa or Europe for example there is no need to split the data; yet this is a detail as long as it works

davide-f avatar Jun 20 '22 21:06 davide-f

  • [ ] Think about heuristic to download in smaller/larger chunks depending on data geographical scope to download
  • [x] Add note to documentation on how to compress cutouts

I attempted to compress cutouts during/after creation but without much success. using zlib integration of xarray the compressed cutouts unfortunately always increased in size (rather than decreasing). Using native netCDF tools compression of cutouts to 30-50% of size is possible without impacts on atlite performance. I want to add notes on this to the documentation with this PR as this allows for larger cutouts.

I would have preferred a solution where compression is done by atlite directly, but it seems like that does not work well using xarray.

euronion avatar Jul 15 '22 11:07 euronion

Codecov Report

Patch coverage: 91.66% and project coverage change: -0.09 :warning:

Comparison is base (f9bd7fd) 72.83% compared to head (d9f3bff) 72.74%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
- Coverage   72.83%   72.74%   -0.09%     
==========================================
  Files          19       19              
  Lines        1590     1596       +6     
  Branches      227      270      +43     
==========================================
+ Hits         1158     1161       +3     
- Misses        362      363       +1     
- Partials       70       72       +2     
Impacted Files Coverage Δ
atlite/datasets/era5.py 88.23% <88.88%> (-1.70%) :arrow_down:
atlite/data.py 86.36% <100.00%> (+0.31%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Sep 06 '22 10:09 codecov-commenter

@davide-f If you wish to reduce the file size you can follow the instructions in the updated doc:

https://github.com/PyPSA/atlite/blob/230aa8a5b1b21bff8f03d23631f01e6ebf5d83b3/examples/create_cutout.ipynb

Should save ~50% :)

euronion avatar Sep 06 '22 11:09 euronion

Month indicator has been added, e.g. info prompt during creation looks like this to indicate the month currently being retrieved

2022-09-06 14:14:27,779 INFO CDS: Downloading variables
         * runoff (2012-12)

euronion avatar Sep 06 '22 12:09 euronion

I suggest we offload the heuristic into a separate issue and tackle it if necessary. ATM I think it would be a nice but unnecessary feature.

euronion avatar Sep 06 '22 12:09 euronion

RTR @FabianHofmann would you?

euronion avatar Sep 06 '22 12:09 euronion

No idea why the CI keeps failing (no issues locally) and why it is continuing the old CI.yaml with Python 3.8 instead of 3.11

euronion avatar Apr 04 '23 09:04 euronion