Trilinos
Trilinos copied to clipboard
Teuchos: Add option to barrier/fence timers
Question
@trilinos/teuchos
Teuchos timer regions can sometimes be misleading when an mpi rank has to wait for other ranks or CUDA_LAUNCH_BLOCKING
is off and there's no fence
to stop a rank from continuing. This is based on some runs I performed while looking at #6079. Is it possible to add an option to add barriers/fences within a timed region? There's probably multiple ways of implementing this and there's probably issues so it'd be good to get some discussion. Maybe something like this:
Kokkos::fence();
MPI_Barrier(comm);
Kokkos::Timer timer;
func();
Kokkos::fence();
double time = timer.seconds();
I could see something similar being done for the stacked timer start/stop and wrapping the fence/barrier with a run-time/compile-time option.
This is preventing us from turning off CUDA_LAUNCH_BLOCKING
completely.
If Kokkos profiling is enabled, Teuchos timers include Kokkos profile regions, which will fence.
Perfect, that might be all I need to stop using CUDA_LAUNCH_BLOCKING
. Any ideas for misleading timers when an mpi rank has to wait for other ranks? It makes it difficult to see which region has issues.
Also, is Kokkos profiling a runtime option?
Any ideas for misleading timers when an mpi rank has to wait for other ranks? It makes it difficult to see which region has issues.
@jewatkins You could either look at min/max/avg, or if using stacked timers, the timer histogram. Would that be sufficient for your needs?
If there are load-imbalance among MPI processes, I was wondering how that will show up.. We've seen load imbalance in compute (e.g., local SuperLU with an irregular mesh), but because we did not have MPI_Barrier
before solve, the imbalance could also show up in solve (e.g., with GMRES's orthogonalization)?
@jhux2 that's usually what I do. If a particular timer region has an mpi barrier and I see the min is small but the average is large, than I assume (although probably not always true) that there's one or more ranks waiting for a rank to "catch up". As @iyamazaki mentioned, there might be a timer region with load imbalance but it's not clear which one it could be (or whether the imbalance is even captured in a timer region). I think a well place mpi barrier before the timer region at least rules out specific regions from being the issue. It would be good to have that as a runtime option instead of having to manually add the barrier and rebuild.
Also, is Kokkos profiling a runtime option?
Yes. You just need to set the right environment variable to point to the Kokkos profiling library of your choice.
See: https://github.com/kokkos/kokkos-tools/wiki
@jewatkins Fwiw, MueLu has a runtime option to add barriers to its timers.
Also, is Kokkos profiling a runtime option?
Yes. You just need to set the right environment variable to point to the Kokkos profiling library of your choice.
See: https://github.com/kokkos/kokkos-tools/wiki
Ah okay, I was thinking of a runtime option to just turn on timer fencing without having to build/use a kokkos profiling tool that will not be used. Victor suggested this already. I think I'm 0/2 so far on making this a thing :(
@jewatkins Fwiw, MueLu has a runtime option to add barriers to its timers.
Very nice, I'll have to try that. If it's in Teuchos, maybe we can avoid having to do that for all packages?
@jewatkins - there are runtime functions you can call to set the profiler in kokkos to avoid having to export env variables. We do this in empire to set our own profiler tools. I can show you this after our meeting later today.
There is also a Teuchos::SyncTimeMonitor
. But I don't think there is a global option to swap that in for the regular time monitor.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This issue was closed due to inactivity for 395 days.