Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

MueLu: SubFactoryMonitor and StackedTimers don't work well together

Open jhux2 opened this issue 2 years ago • 7 comments

Bug Report

@trilinos/muelu

MueLu's SubFactoryMonitors don't appear to play well with StackedTimer reporting.

[EDIT] This is on Crusher (AMD). So a platform specific issue is possible.

Excerpt from a StackedTimer summary based on SubFactoryMonitors. Notice that all the time is in the last Remainder Notice that SFM don't nest correctly

|   |   |   |   MueLu: Ifpack2Smoother: Setup Smoother (total): 1.17603 - 99.9032% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Get matrix from current level (sub, total): 1.072e-06 - 9.11542e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Call 'SetupChebyshev' (sub, total): 6.11e-07 - 5.19545e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Cast matrix to Tpetra::RowMatrix (sub, total): 5.11e-07 - 4.34513e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Estimate max eigenvalue (sub, total): 2.11e-07 - 1.79417e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Get lumped diagonal (sub, total): 5.71e-07 - 4.85532e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: SetPrecParameters: 0.0138927 - 1.18133% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Preconditioner init (sub, total): 9.21e-07 - 7.83144e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: Preconditioner compute (sub, total): 3e-07 - 2.55096e-05% [1]
|   |   |   |   |   Ifpack2::Chebyshev::compute: 0.0239956 - 2.04039% [1]
|   |   |   |   |   |   Ifpack2: powerMethodWithInitGuess: 0.0191033 - 79.6119% [1]
|   |   |   |   |   |   Remainder: 0.00489225 - 20.3881%
|   |   |   |   |   MueLu: Ifpack2Smoother: Determine lambdaMax (sub, total): 7.21e-07 - 6.1308e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: toggle setup boolean (sub, total): 2.91e-07 - 2.47443e-05% [1]
|   |   |   |   |   MueLu: Ifpack2Smoother: print description (sub, total): 3.01e-07 - 2.55946e-05% [1]
|   |   |   |   |   Remainder: 1.13814 - 96.7778%

Excerpt from a StackedTimer summary, the same code as above, but replacing SubFactoryMonitors with raw Teuchos::Timers. Notice that the time is now attributed correctly, and the remainder is small. Notice that SFM are now nested correctly.

|   |   |   |   MueLu: Ifpack2Smoother: Setup Smoother (total): 1.20185 - 99.909% [1]
|   |   |   |   |   Get matrix from current level: 3.938e-06 - 0.000327663% [1]
|   |   |   |   |   get non-const ref to param list: 6.81e-07 - 5.66629e-05% [1]
|   |   |   |   |   Calll "SetupChebyshev": 1.2018 - 99.9959% [1]
|   |   |   |   |   |   Cast matrix to Tpetra::RowMatrix: 7.11e-07 - 5.91615e-05% [1]
|   |   |   |   |   |   Estimate max eigenvalue: 1.1635 - 96.8137% [1]
|   |   |   |   |   |   |   Get lumped diagonal: 1.16347 - 99.9969% [1]
|   |   |   |   |   |   |   Remainder: 3.5739e-05 - 0.00307167%
|   |   |   |   |   |   MueLu: Ifpack2Smoother: SetPrecParameters: 0.0142344 - 1.18443% [1]
|   |   |   |   |   |   Preconditioner init: 3.697e-06 - 0.000307623% [1]
|   |   |   |   |   |   Preconditioner compute: 0.0239759 - 1.99501% [1]
|   |   |   |   |   |   |   Ifpack2::Chebyshev::compute: 0.0239725 - 99.9861% [1]
|   |   |   |   |   |   |   |   Ifpack2: powerMethodWithInitGuess: 0.0191433 - 79.8551% [1]
|   |   |   |   |   |   |   |   Remainder: 0.00482924 - 20.1449%
|   |   |   |   |   |   |   Remainder: 3.337e-06 - 0.0139182%
|   |   |   |   |   |   Determine lambdaMax: 4.8824e-05 - 0.00406259% [1]
|   |   |   |   |   |   Remainder: 2.8696e-05 - 0.00238776%
|   |   |   |   |   toggle setup boolean: 8.11e-07 - 6.74796e-05% [1]
|   |   |   |   |   print description: 2.2614e-05 - 0.00188161% [1]
|   |   |   |   |   Remainder: 2.1712e-05 - 0.00180656%

jhux2 avatar Feb 23 '23 20:02 jhux2

Automatic mention of the @trilinos/muelu team

github-actions[bot] avatar Feb 23 '23 20:02 github-actions[bot]

That's unfortunate. Does that mean we need to fix SubFactoryMonitors somehow, or is it more difficult than that?

GrahamBenHarper avatar Feb 23 '23 20:02 GrahamBenHarper

Hopefully just SubFactoryMonitor itself.

jhux2 avatar Feb 23 '23 20:02 jhux2

@cgcgcg had some suggestions/questions:

  1. Is this a single-rank phenomenon? No, happens on multi-mpi-rank jobs also.
  2. Do the nightly performance runs show the same issue? No.
  3. Try running MueLu's driver from the performance build. Exhibits the same weird behavior.

Items 2) and 3) indicate this may be an environment issue.

jhux2 avatar Feb 23 '23 22:02 jhux2

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] avatar Feb 24 '24 12:02 github-actions[bot]

@lucbv Have you also seen this on Frontier?

jhux2 avatar Feb 24 '24 17:02 jhux2

@jhux2 I recently fixed something, maybe that was the reason? https://github.com/trilinos/Trilinos/pull/12753/commits/1f2cc316e4804138a9341518ca96f73e2c059b0a

cgcgcg avatar Feb 24 '24 18:02 cgcgcg