jwst
jwst copied to clipboard
JP-3690: Switch from ModelContainer to ModelLibrary for image3 pipeline
Resolves JP-3690 Resolves JP-3619 Resolves JP-3620 Resolves JP-3621, see JP-3707 for additional work on resample memory usage beyond the scope of this ticket. Resolves JP-3498 Work is under the epic JP-3602.
Fixes #8649 Fixes #8478 Fixes #8479 Fixes #8480 Fixes #8164
This PR is part of a larger effort to improve memory usage throughout the pipeline. Here, the AbstractModelLibrary class in stpipe is subclassed for JWST, and the pipeline steps that form the calwebb_image3 pipeline are updated to use ModelLibrary instead of ModelContainer. This facilitates ensuring that the pipeline step runs the same way whether models are loaded into memory or saved to disk.
When ModelLibrary is used with on_disk=True, memory usage is lower, both for individual pipeline steps and for calwebb_spec3 as a whole. An analysis of memory usage for each step can be found in the linked tickets JP-3619, JP-3620, and JP-3621. An analysis of the memory usage as a whole, when run on a large dataset, can be found in JP-3690.
Checklist for PR authors (skip items if you don't have permissions or they are not applicable)
- [x] added entry in
CHANGES.rstwithin the relevant release section - [x] updated or added relevant tests
- [x] updated relevant documentation
- [x] added relevant milestone
- [x] added relevant label(s)
- [x] ran regression tests, post a link to the Jenkins job below. How to run regression tests on a PR
- [ ] All comments are resolved
- [ ] Make sure the JIRA ticket is resolved properly
Codecov Report
Attention: Patch coverage is 89.09396% with 65 lines in your changes missing coverage. Please review.
Project coverage is 60.77%. Comparing base (
04217a2) to head (364cd60).
Additional details and impacted files
@@ Coverage Diff @@
## master #8683 +/- ##
==========================================
+ Coverage 60.63% 60.77% +0.13%
==========================================
Files 372 373 +1
Lines 38375 38596 +221
==========================================
+ Hits 23270 23458 +188
- Misses 15105 15138 +33
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Initial regtests started here just to find (probably many) failures.
more regtests here for same reason, after lots of changes to memory allocation
Latest round of regression tests here. These unfortunately include 50 failures that are all related to the delivery of new MIRI reference files and other MIRI changes on the INS side. None of the failures from the previous regtest run are still present, so I'm pretty confident this PR is passing all regtests now and is ready for review
I've finished viewing and commenting on all the files in this PR. Overall I think the changes look great and the comments are mostly minor. Thanks for taking on this work! I'll follow up with comments but please let me know if there's more I can do to help to move this PR along.
I ran a 972 member association (~100GB input data) through calwebb_image3 using this PR (an "on disk" library was used as input and a slightly older commit 26e543627d9ed8e28af6475f0f3fe7f8ab7862cf). The pipeline succeeded and the recorded memory usage (using memray) was as follows:
I will add this information and a link to the memory profile to the jira ticket.
The peak memory usage was 50GB and this is largely due to the context array generated during resample. Importantly for this PR, at no point does the pipeline load all input data into memory.
testing just test_nircam_image.py here, because it takes forever on my local machine
Most recent regtests for test_nircam_image only are finally passing after fixing file naming issues. The same fix should apply for the three MIRI image3 failures here. I'm going to wait on another full regtest run until after the next round of comments.
Thanks again for all your work on this. I went through the files and left a new set of comments. It looks very close to done.
New round of regression tests started here after responding to comments from Brett's second review.
edit: file name issues recurred, apparently the small bug setting pool and table names in ModelLibrary was not the cause. Need to put back in the manual setting of model.meta.filename. Starting another round here
Another round of regtests after all the nircam image and miri image failures are fixed is here. Fingers crossed that the only failures are the same ones that show up on the nightly runs
I believe that all the regression tests are unrelated
Why there is still usage for the ModelContainer? Is ModelLibrary not completely replacing ModelContainer? Is it only for backward compatibility to support ModelContainer inputs?
Why there is still usage for the
ModelContainer? IsModelLibrarynot completely replacingModelContainer? Is it only for backward compatibility to supportModelContainerinputs?
We decided to only replace container with library in the calwebb_image3 pipeline for now, where the memory performance gains are most important. There are other complications with replacing this for the spectroscopic modes too, e.g., the SourceModelContainer class that uses ModelContainer as a parent class needs to be replaced also
Another round of regression tests have been started here after fixes per Melanie's review.
Edit: lots of failing tests, but they match last night's nightly run