Add read_netcdf function
Please confirm that this PR has done the following:
- [x] Tests Added
- [ ] Documentation Added
- [ ] Name of contributors Added to AUTHORS.rst
- [ ] Description in RELEASE_NOTES.md Added
Adding to RELEASE_NOTES.md (remove section after adding to RELEASE_NOTES.md)
Please add a single line in the release notes similar to the following:
- (#XX)[http://link-to-pr.com] Add read_netcdf() function
Description of PR
This PR adds a read_netcdf() function to read netcdf data (.nc). Note that this function currently does not support sub-annual data and meta indicators have the exact indices from META_IDX (currently ‘model’ and ‘scenario’).
Thanks a lot for your comments! To answer your question about aggregated Calliope netCDF, the current output from Calliope (“standard” netCDF) has five dimensions: nodes, techs, carriers, timesteps, and costs. They contain variables according to the model’s specifications, e.g., flow cap, flow in, flow out, flow export, etc.
Aggregated Calliope netCDF (the input data for pyam.read_netcdf) are aggregated or disintegrated from standard Calliope netCDF files into variables compatible with the IAMC standard, such as Primary Energy | Coal. They have four dimensions: model, scenario, region, and time. Unit is an attribute of each variable. In the future, Calliope will be updated to also produce aggregated netCDF as an optional output.
I will be on vacation for the next two weeks and will address any further comments from you when I'm back. Best regards, Linh
Sorry for the delay in the review, due to your and my vacation and then too many other ongoing activities...
Pinging @coroa @Rlamboll @znicholls, @LinhHo started implementation of a read_netcdf method in pyam. Could you take a look at whether the nc file that she is using is structured in a way that is useful for the IAMC-format domain (including datetime support)?
There's lots of ways to do something like this with netCDF (similar to CSV, lots of options). The trick is more about just picking something that works.
If all you're doing is reading data, then it doesn't really matter. I would probably give the method a more specifc name. E.g. read_netcdf_caliope, because you don't have an equivalent write_netcdf function that makes it more obvious what format is expected.
There are probably some performance things you could improve in the PR, but if you test that it can read data with time columns and test that it can read data with year columns, you're probably good to go and you can fix performance later.
If you want to also support reading more arbitrary netCDF and writing netCDF, read this
The first thing to understand is that there isn't a standard netCDF format. People do crazy things, so supporting general netCDF isn't really a thing. If you have a write_netcdf method though, it's easier to be like, "We support reading netCDF written with write_netcdf or equivalents thereof.".
The thing we found difficult in scmdata was supporting arbitrary metadata columns when writing. That makes your life much trickier. If you don't care about capturing the meta table, that would make things simpler.
The other thing that can be tricky is the sparsity of data. If you want to store lots of model-scenario combinations, but not all models have run all scenarios, depending on how you format the data, you can end up with a lot of nans which can be annoying for memory or data storage/handling.
If you want the full details, our docs go into this sparsity issue here: https://scmdata.readthedocs.io/en/latest/notebooks/netcdf.html
In terms of how we implemented it, handling all our use cases was tricky (see https://github.com/openscm/scmdata/blob/main/src/scmdata/netcdf.py and https://github.com/openscm/scmdata/blob/main/src/scmdata/_xarray.py). You could do something much simpler if you make more restrictions than we did.
Thanks for the comments! Indeed, there is no standard netCDF format. This function aims to read a netCDF format that is compatible with the IAMC convention and can be read easily into pyam.IamDataFrame format. I wrote a description of how the netCDF input file should be structured for the function read_netcdf to work properly, please see this Jupyter notebook. Since anyone can prepare their data in this netCDF format to use read_netcdf, and there is nothing in it specifically for Calliope, I would say read_netcdf_calliope is not necessary.
@danielhuppmann please let me know if putting the Jupyter notebook in the docs/tutorials is appropriate. I was not sure where to put the documentation. Thank you!
Awesome, thanks @LinhHo !!
Looks like you found a very inconvenient corner case...
https://docs.xarray.dev/en/stable/user-guide/io.html
I suggest the following: replace all 'None's in string-columns in meta by "NaN" when writing to netcdf, and then replace them again when reading from file.
@LinhHo, I think with the quickfix I proposed above (replacing "nan"-strings by np.nan and a little bit of clean-up (like adding a readme, removing the print statement), this should be good to go, right?
@danielhuppmann Yes, I think so. Thanks a lot for all the comments! I have just updated the release note and cleaned up the code. Could you have an overall look again if we can finalise this PR, please? By readme did you mean a documentation like this Jupyter notebook? https://github.com/LinhHo/pyam/blob/netcdf/docs/tutorials/read_netcdf.ipynb
readme did you mean a documentation like this Jupyter notebook?
Sorry, I meant release notes. Which you have already added but in the wrong place.
Tests are failing because the lock file isn't up to date and the release-note ist still in the wrong place - I made https://github.com/LinhHo/pyam/pull/1 to fix this for you. Please review and approach that PR, then tests for this PR should pass.
I took the liberty of pushing an update of the lock file directly to your branch
Looks like tests are failing because matplotlib published a new release, so the plots generated in the tests do not exactly match the expected figures. I will take care of that next week.
Codecov Report
Attention: Patch coverage is 81.81818% with 8 lines in your changes missing coverage. Please review.
Project coverage is 95.0%. Comparing base (
ddbb88e) to head (6b73a54). Report is 46 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| pyam/netcdf.py | 79.4% | 8 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #879 +/- ##
======================================
Coverage 95.0% 95.0%
======================================
Files 64 65 +1
Lines 6134 6278 +144
======================================
+ Hits 5828 5965 +137
- Misses 306 313 +7
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚀 New features to boost your workflow:
- ❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I realized that if I fix the matplotlib expected-plots in a separate branch, that would again create merge conflicts on poetry.lock. So I added the expected plots to this branch.