vt #1934: Add parameter to control minimal retention of historical LB data

Closes #1934

Oct 17 '22 12:10 thearusable

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for a8ff97d36b85071978643867781882e607239951 (2024-01-30 17:22:21 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-3.9, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-5, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-6, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-5.0, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for dec4a3e60a54b3a4c2e041a04abdfacaf5ba3f29 (2023-04-04 10:47:19 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-9, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, alpine, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-11, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-12, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (intel icpx, ubuntu, mpich)

Build for 0b5ee2ed09bb3dcc0b0f0b617f0bbfcddb8623b1 (2022-11-11 13:45:39 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-14, ubuntu, mpich)

Build for a8ff97d36b85071978643867781882e607239951 (2024-01-30 17:22:21 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-10, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich, json schema test)

Build for 4e93971c59f15f80c77d8b82c997b2c6d470784a (2023-03-28 10:16:37 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (intel icpc, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhi%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

PR tests (nvidia cuda 11.2, ubuntu, mpich)

Build for dec4a3e60a54b3a4c2e041a04abdfacaf5ba3f29 (2023-04-04 10:47:19 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(221): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log

PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 4be442ca9e84aeb3e33567309fcdac69e886ef44 (2024-06-12 13:05:01 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (intel icpx, ubuntu, mpich, verbose)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (clang-14, ubuntu, mpich, verbose)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log

PR tests (gcc-12, ubuntu, mpich, verbose, kokkos)

Build for 0b2489f731f4bb250d62e4ea96e167f683f88dbf (2024-09-26 16:27:09 UTC)

Compilation - successful

Testing - passed

Build log

Oct 17 '22 13:10 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.93%. Comparing base (7be56f2) to head (d074668). Report is 495 commits behind head on develop.

:exclamation: Current head d074668 differs from pull request most recent head c6d265f

Please upload reports for the commit c6d265f to get more accurate results.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1996      +/-   ##
===========================================
- Coverage    85.48%   84.93%   -0.55%     
===========================================
  Files          722      723       +1     
  Lines        25907    25662     -245     
===========================================
- Hits         22146    21796     -350     
- Misses        3761     3866     +105

Files	Coverage Δ
src/vt/configs/arguments/app_config.h	`100.00% <ø> (ø)`
src/vt/configs/arguments/args.cc	`94.57% <100.00%> (ø)`
.../vt/vrt/collection/balance/lb_invoke/lb_manager.cc	`80.96% <100.00%> (+0.05%)`	:arrow_up:
...c/vt/vrt/collection/balance/lb_invoke/lb_manager.h	`100.00% <100.00%> (ø)`
src/vt/vrt/collection/balance/node_lb_data.cc	`84.45% <100.00%> (+0.10%)`	:arrow_up:
src/vt/vrt/collection/balance/node_lb_data.h	`100.00% <100.00%> (ø)`
tests/unit/collection/test_lb_data_retention.cc	`100.00% <100.00%> (ø)`

... and 271 files with indirect coverage changes

Oct 25 '22 15:10 codecov[bot]

Other than the minor interface thing I noted, I'd approve this.

Oct 26 '22 23:10 PhilMiller

@PhilMiller Sorry for the delay, I missed your message.

From what I understand the scenario that you described is the only one which can have some weird behaviors. LB data is being cleaned after each phase so that will occur rather frequently. So if there is more data than model needs then it will not be retained for long. The only way to retain that data even if it is not needed is to use the configuration LBType::NoLB, in this case LB will not be run so also the data will not be removed.

Nov 08 '22 12:11 thearusable

The reason I see the potential for retention is that the cleaning removes data from a particular past phase at a fixed offset from the current phase. if the model's requested retention falls, then there will be a range of phases whose data gets skipped over.

Nov 11 '22 18:11 PhilMiller

@PhilMiller I added the test cases which you mentioned but I wanted to ask you about the checks which should be in those cases. Currently I'm checking the amount of phases LB data in theNodeLBData() and also amount of data in TestCol. The first one is working fine, the data is removed as expected. The second check is failing, the amount of LB data in the TestCol is not changed. Should that be the case? Or I should look for some way to remove that data also, maybe by adding additional code in theCollection() for example?

What are the steps to change the Load Model? Currently I'm only doing the theLBManager()->setLoadModel(model_1_phase);. Is that an only required step or I missing something here?

Dec 13 '22 11:12 thearusable

Calling setLoadModel should be all that's necessary.

Is cleanup not happening at all for TestCol, or is it just a phase behind or something like that? If the cleanup would happen with some finite delay or some finite overage relative to theNodeLBData(), I'd be fine with loosening the test to reflect that.

It may be acceptable (for now, at least) if TestCol were to always retain at most (one more than) the maximum number of retained steps ever requested. The case we really want to avoid is unbounded data growth. As long as you can show that with the logic and tests, I'll support moving this functionality forward.

There's maybe an edge case of "extra data retained per call to setLoadModel()" if an application were to do something pathological, like alternately set models that wanted 10 and 20 steps of data retained repeatedly. Maybe we can accept something like that for now, with just an open issue to address it later.

That all said, if the data is/were retained in an ordered map, then perhaps instead of looping over keys we think need to be deleted, we could take a somewhat different approach to cleanup of (roughly)

for (auto i = data.begin(); *i < current_phase - retention; /* nothing */)
  i = data.erase(i);

That is, just delete anything the map may have that's older than what we're interested in.

Dec 21 '22 17:12 PhilMiller

Calling setLoadModel should be all that's necessary.

Is cleanup not happening at all for TestCol, or is it just a phase behind or something like that? If the cleanup would happen with some finite delay or some finite overage relative to theNodeLBData(), I'd be fine with loosening the test to reflect that.

It may be acceptable (for now, at least) if TestCol were to always retain at most (one more than) the maximum number of retained steps ever requested. The case we really want to avoid is unbounded data growth. As long as you can show that with the logic and tests, I'll support moving this functionality forward.

There's maybe an edge case of "extra data retained per call to setLoadModel()" if an application were to do something pathological, like alternately set models that wanted 10 and 20 steps of data retained repeatedly. Maybe we can accept something like that for now, with just an open issue to address it later.

That all said, if the data is/were retained in an ordered map, then perhaps instead of looping over keys we think need to be deleted, we could take a somewhat different approach to cleanup of (roughly)
for (auto i = data.begin(); *i < current_phase - retention; /* nothing */)
  i = data.erase(i);
That is, just delete anything the map may have that's older than what we're interested in.

Cleanup is not happening for TestCol, it will retain the maximum number of phases ever requested by the model. Cache data in node is stored in unordered_map so I kept the previous implementation, I cleaned it a little bit and move it to node_lb_data.

Dec 29 '22 12:12 thearusable

@PhilMiller I updated the implementation to use lower_bound. This PR can be reviewed again.

May 23 '23 15:05 thearusable

@nlslatt I updated the tests to better reflect how many and which phases are being kept.

Test test_lbdata_retention_last2 - has vt_lb_data_retention lower than what model needs. Test test_lbdata_config_retention_higher - has vt_lb_data_retention higher than what model needs.

Jun 06 '23 10:06 thearusable

@nlslatt, @lifflander - I rebased this PR and fixed all conflicts. PR is ready for review.

Oct 19 '23 13:10 thearusable

Things to do:

[x] Prepare benchmark for existing implementation
[x] Prepare different implementation based on std::vector and prepare benchmark for it.

Aug 13 '24 16:08 thearusable