#1934: Add parameter to control minimal retention of historical LB data
Closes #1934
Pipelines results
PR tests (gcc-12, ubuntu, mpich)
Build for a8ff97d36b85071978643867781882e607239951 (2024-01-30 17:22:21 UTC)
Compilation - successful
Testing - passed
PR tests (clang-3.9, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-10, ubuntu, openmpi, no LB)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-5, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-6, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-9, ubuntu, mpich, zoltan)
Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)
Compilation - successful
Testing - passed
PR tests (clang-5.0, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)
Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)
Compilation - successful
Testing - passed
PR tests (nvidia cuda 11.0, ubuntu, mpich)
Build for dec4a3e60a54b3a4c2e041a04abdfacaf5ba3f29 (2023-04-04 10:47:19 UTC)
Compilation - successful
Testing - passed
PR tests (nvidia cuda 10.1, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (clang-9, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (clang-13, alpine, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-8, ubuntu, mpich, address sanitizer)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (clang-11, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (clang-12, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (clang-13, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (intel icpx, ubuntu, mpich)
Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)
Compilation - successful
Testing - passed
PR tests (clang-14, ubuntu, mpich)
Build for a8ff97d36b85071978643867781882e607239951 (2024-01-30 17:22:21 UTC)
Compilation - successful
Testing - passed
PR tests (clang-10, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-11, ubuntu, mpich, json schema test)
Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)
Compilation - successful
Testing - passed
PR tests (intel icpc, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhibited by limit max-total-size
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==
PR tests (nvidia cuda 11.2, ubuntu, mpich)
Build for dec4a3e60a54b3a4c2e041a04abdfacaf5ba3f29 (2023-04-04 10:47:19 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
detected during:
instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]"
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here
/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
/vt/examples/callback/callback.cc(147): here
/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
/vt/examples/callback/callback.cc(153): here
/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
/vt/examples/callback/callback.cc(147): here
/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==
PR tests (gcc-12, ubuntu, mpich, verbose)
Build for 4be442ca9e84aeb3e33567309fcdac69e886ef44 (2024-06-12 13:05:01 UTC)
Compilation - successful
Testing - passed
PR tests (intel icpx, ubuntu, mpich, verbose)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (clang-14, ubuntu, mpich, verbose)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
TT { std::declval<CC>() }
^
detected during:
instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
TT { std::declval<CC>() }
^
detected during:
instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315
/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
Testing - passed
PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)
Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)
Compilation - successful
Testing - passed
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.93%. Comparing base (
7be56f2) to head (d074668). Report is 495 commits behind head on develop.
:exclamation: Current head d074668 differs from pull request most recent head c6d265f
Please upload reports for the commit c6d265f to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## develop #1996 +/- ##
===========================================
- Coverage 85.48% 84.93% -0.55%
===========================================
Files 722 723 +1
Lines 25907 25662 -245
===========================================
- Hits 22146 21796 -350
- Misses 3761 3866 +105
| Files | Coverage Δ | |
|---|---|---|
| src/vt/configs/arguments/app_config.h | 100.00% <ø> (ø) |
|
| src/vt/configs/arguments/args.cc | 94.57% <100.00%> (ø) |
|
| .../vt/vrt/collection/balance/lb_invoke/lb_manager.cc | 80.96% <100.00%> (+0.05%) |
:arrow_up: |
| ...c/vt/vrt/collection/balance/lb_invoke/lb_manager.h | 100.00% <100.00%> (ø) |
|
| src/vt/vrt/collection/balance/node_lb_data.cc | 84.45% <100.00%> (+0.10%) |
:arrow_up: |
| src/vt/vrt/collection/balance/node_lb_data.h | 100.00% <100.00%> (ø) |
|
| tests/unit/collection/test_lb_data_retention.cc | 100.00% <100.00%> (ø) |
Other than the minor interface thing I noted, I'd approve this.
@PhilMiller Sorry for the delay, I missed your message.
From what I understand the scenario that you described is the only one which can have some weird behaviors. LB data is being cleaned after each phase so that will occur rather frequently. So if there is more data than model needs then it will not be retained for long.
The only way to retain that data even if it is not needed is to use the configuration LBType::NoLB, in this case LB will not be run so also the data will not be removed.
The reason I see the potential for retention is that the cleaning removes data from a particular past phase at a fixed offset from the current phase. if the model's requested retention falls, then there will be a range of phases whose data gets skipped over.
@PhilMiller I added the test cases which you mentioned but I wanted to ask you about the checks which should be in those cases.
Currently I'm checking the amount of phases LB data in theNodeLBData() and also amount of data in TestCol.
The first one is working fine, the data is removed as expected.
The second check is failing, the amount of LB data in the TestCol is not changed. Should that be the case? Or I should look for some way to remove that data also, maybe by adding additional code in theCollection() for example?
What are the steps to change the Load Model? Currently I'm only doing the theLBManager()->setLoadModel(model_1_phase);. Is that an only required step or I missing something here?
Calling setLoadModel should be all that's necessary.
Is cleanup not happening at all for TestCol, or is it just a phase behind or something like that? If the cleanup would happen with some finite delay or some finite overage relative to theNodeLBData(), I'd be fine with loosening the test to reflect that.
It may be acceptable (for now, at least) if TestCol were to always retain at most (one more than) the maximum number of retained steps ever requested. The case we really want to avoid is unbounded data growth. As long as you can show that with the logic and tests, I'll support moving this functionality forward.
There's maybe an edge case of "extra data retained per call to setLoadModel()" if an application were to do something pathological, like alternately set models that wanted 10 and 20 steps of data retained repeatedly. Maybe we can accept something like that for now, with just an open issue to address it later.
That all said, if the data is/were retained in an ordered map, then perhaps instead of looping over keys we think need to be deleted, we could take a somewhat different approach to cleanup of (roughly)
for (auto i = data.begin(); *i < current_phase - retention; /* nothing */)
i = data.erase(i);
That is, just delete anything the map may have that's older than what we're interested in.
Calling
setLoadModelshould be all that's necessary.Is cleanup not happening at all for
TestCol, or is it just a phase behind or something like that? If the cleanup would happen with some finite delay or some finite overage relative totheNodeLBData(), I'd be fine with loosening the test to reflect that.It may be acceptable (for now, at least) if
TestColwere to always retain at most (one more than) the maximum number of retained steps ever requested. The case we really want to avoid is unbounded data growth. As long as you can show that with the logic and tests, I'll support moving this functionality forward.There's maybe an edge case of "extra data retained per call to
setLoadModel()" if an application were to do something pathological, like alternately set models that wanted 10 and 20 steps of data retained repeatedly. Maybe we can accept something like that for now, with just an open issue to address it later.That all said, if the data is/were retained in an ordered
map, then perhaps instead of looping over keys we think need to be deleted, we could take a somewhat different approach to cleanup of (roughly)for (auto i = data.begin(); *i < current_phase - retention; /* nothing */) i = data.erase(i);That is, just delete anything the map may have that's older than what we're interested in.
Cleanup is not happening for TestCol, it will retain the maximum number of phases ever requested by the model.
Cache data in node is stored in unordered_map so I kept the previous implementation, I cleaned it a little bit and move it to node_lb_data.
@PhilMiller I updated the implementation to use lower_bound. This PR can be reviewed again.
@nlslatt I updated the tests to better reflect how many and which phases are being kept.
Test test_lbdata_retention_last2 - has vt_lb_data_retention lower than what model needs.
Test test_lbdata_config_retention_higher - has vt_lb_data_retention higher than what model needs.
@nlslatt, @lifflander - I rebased this PR and fixed all conflicts. PR is ready for review.
Things to do:
- [x] Prepare benchmark for existing implementation
- [x] Prepare different implementation based on
std::vectorand prepare benchmark for it.