Improve performance of IamDataFrame initialization (phase 2)

Open danielhuppmann opened this issue 2 years ago • 4 comments

Please confirm that this PR has done the following:

~Tests Added~
~Documentation Added~
~Name of contributors Added to AUTHORS.rst~
[x] Description in RELEASE_NOTES.md Added

Description of PR

This PR is a follow-up to #579, implementing further performance improvements.

Sep 15 '21 17:09 danielhuppmann

Codecov Report

Merging #580 (179e813) into main (151e330) will increase coverage by 0.0%. The diff coverage is 94.4%.

@@          Coverage Diff          @@
##            main    #580   +/-   ##
=====================================
  Coverage   93.7%   93.7%           
=====================================
  Files         50      50           
  Lines       5339    5348    +9     
=====================================
+ Hits        5004    5013    +9     
  Misses       335     335

Impacted Files	Coverage Δ
pyam/plotting.py	`92.9% <50.0%> (+<0.1%)`	:arrow_up:
pyam/utils.py	`91.8% <95.4%> (+<0.1%)`	:arrow_up:
pyam/core.py	`94.3% <100.0%> (ø)`
pyam/index.py	`98.0% <100.0%> (ø)`
pyam/logging.py	`64.8% <100.0%> (+5.4%)`	:arrow_up:
pyam/time.py	`96.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 151e330...179e813. Read the comment docs.

Sep 15 '21 17:09 codecov[bot]

Summarizing bilateral discussions with (and manual benchmarking done by) @phackstock - this PR shows again some improvements in memory usage. One interesting observation is that the initial commit here https://github.com/danielhuppmann/pyam/commit/d652297fc39c9176be661bb508d97d42d02b795b, which uses df.set_index(.., append=True) performs much worse than either the previous implementation or the "manual" adding-to-index using the pyam.index module...

Sep 16 '21 12:09 danielhuppmann

Running benchmarking with pytest-monitor and memory-profiler on the IAMC 1.5°C scenario ensemble data for all regions (~80MB, xlsx) shows that this PR increases time use by ~20%, but reduces memory use by 30%... Not quite sure if that is a worthwhile trade-off, or if it can be improved...

Sep 29 '21 10:09 danielhuppmann

Regarding your question @danielhuppmann my vote would be in favor of saving memory even if the price for that is a longer execution time. My reasoning is that a longer execution time means more waiting for the user while memory savings can decide whether or not a user might be able to open a data set at all. Ideally, if you're working in a jupyter notebook you read the data only once and keep it in memory anyway so I don't think that a plus in execution time is that big of a deal.

Oct 06 '21 09:10 phackstock

closing in favor of #729 and #730

Mar 06 '23 06:03 danielhuppmann

pyam pyam copied to clipboard

Improve performance of IamDataFrame initialization (phase 2)

Please confirm that this PR has done the following:

Description of PR

Codecov Report

pyam
pyam copied to clipboard