pyam icon indicating copy to clipboard operation
pyam copied to clipboard

Improve performance of IamDataFrame initialization (phase 2)

Open danielhuppmann opened this issue 2 years ago • 4 comments

Please confirm that this PR has done the following:

  • ~Tests Added~
  • ~Documentation Added~
  • ~Name of contributors Added to AUTHORS.rst~
  • [x] Description in RELEASE_NOTES.md Added

Description of PR

This PR is a follow-up to #579, implementing further performance improvements.

danielhuppmann avatar Sep 15 '21 17:09 danielhuppmann

Codecov Report

Merging #580 (179e813) into main (151e330) will increase coverage by 0.0%. The diff coverage is 94.4%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #580   +/-   ##
=====================================
  Coverage   93.7%   93.7%           
=====================================
  Files         50      50           
  Lines       5339    5348    +9     
=====================================
+ Hits        5004    5013    +9     
  Misses       335     335           
Impacted Files Coverage Δ
pyam/plotting.py 92.9% <50.0%> (+<0.1%) :arrow_up:
pyam/utils.py 91.8% <95.4%> (+<0.1%) :arrow_up:
pyam/core.py 94.3% <100.0%> (ø)
pyam/index.py 98.0% <100.0%> (ø)
pyam/logging.py 64.8% <100.0%> (+5.4%) :arrow_up:
pyam/time.py 96.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 151e330...179e813. Read the comment docs.

codecov[bot] avatar Sep 15 '21 17:09 codecov[bot]

Summarizing bilateral discussions with (and manual benchmarking done by) @phackstock - this PR shows again some improvements in memory usage. One interesting observation is that the initial commit here https://github.com/danielhuppmann/pyam/commit/d652297fc39c9176be661bb508d97d42d02b795b, which uses df.set_index(.., append=True) performs much worse than either the previous implementation or the "manual" adding-to-index using the pyam.index module...

danielhuppmann avatar Sep 16 '21 12:09 danielhuppmann

Running benchmarking with pytest-monitor and memory-profiler on the IAMC 1.5°C scenario ensemble data for all regions (~80MB, xlsx) shows that this PR increases time use by ~20%, but reduces memory use by 30%... Not quite sure if that is a worthwhile trade-off, or if it can be improved...

danielhuppmann avatar Sep 29 '21 10:09 danielhuppmann

Regarding your question @danielhuppmann my vote would be in favor of saving memory even if the price for that is a longer execution time. My reasoning is that a longer execution time means more waiting for the user while memory savings can decide whether or not a user might be able to open a data set at all. Ideally, if you're working in a jupyter notebook you read the data only once and keep it in memory anyway so I don't think that a plus in execution time is that big of a deal.

phackstock avatar Oct 06 '21 09:10 phackstock

closing in favor of #729 and #730

danielhuppmann avatar Mar 06 '23 06:03 danielhuppmann