ompi optionally passive wait when the progress loop is idle for a while

Add the new mpi_poll_when_idle and mpi_poll_threhold MCA parameters to control if and when the progress loop should poll() when idle and when polling should start.

The default is not to poll when idle.

Thanks Paul Kapinos for bringing this to our attention

Signed-off-by: Gilles Gouaillardet [email protected]

Oct 12 '17 07:10 ggouaillardet

@jsquyres can you please review this ?

Oct 12 '17 08:10 ggouaillardet

I'm no expert on poll, but I am curious to know how calling poll with a NULL argument will impact the event library.

Oct 12 '17 16:10 rhc54

@jsquyres i merged both mechanisms per your comment

mpirun --mca mpi_yield_when_idle true ...

will simply sched_yield(), in order to start sleeping after a while, just do

mpirun --mca mpi_yield_when_idle true --mca mpi_sleep_when_idle_threshold <value> ...

with a positive value.

@rhc54 i am not sure i fully understand your concern. as far as i understand poll(NULL, 0, 1) is just a way to sleep for 1 millisecond. usleep() could be used (select() could be even used here), but i am not sure it is available on all platforms. or are you saying we should return from opal_progress() asap and use libevent timeout instead ?

Oct 13 '17 04:10 ggouaillardet

Your understanding matches my own - it was your comments in this PR that caused my confusion. You seemed to imply that somehow we were using the threshold to start polling file descriptors, but that isn't what you were doing at all - you're just sleeping to cause the scheduler to kick us out. It was very confusing.

Oct 13 '17 04:10 rhc54

ok, i will do some rewording use sleep instead of poll and add a note on how sleep is implemented.

Oct 13 '17 04:10 ggouaillardet

@jsquyres i share the same vision. just to be clear, what should be the default (e.g. spin only vs spin then yield then poll) ? should it depend whether we oversubscribe or not ?

you have a good point with respect to MPI_Test, and i guess the same applies to MPI_Iprobe and MPI_Improbe (are we missing some more ?)

Oct 16 '17 08:10 ggouaillardet

@ggouaillardet Yes, perhaps the demarkation line should be exiting the MPI library. I.e., when MPI_TEST (or MPI_IPROBE or MPI_ISEND or ...) returns, the counters -- or whatever measures of "contiguous" we use -- should be reset.

As for what the default should be, I'm not sure. I have dim recollections of some vendor MPI touting the power efficiencies of doing spin-then-yield by default a while ago (which is a dubious claim at best -- if your program is doing nothing for so long such that you frequently get into "MPI can yield the processor without harming performance" scenarios, then your program is not efficient to begin with, and therefore any power efficiencies gained by spin-then-yield probably mean that you're only wasting less energy than you were before).

This is probably a topic best discussed by the community -- others may have direct experience with this kind of thing. @bosilca @gpapaure @jjhursey @edgargabriel @artpol84 @jladd-mlnx @rhc54 ...etc. -- anyone have an opinion here?

Oct 16 '17 11:10 jsquyres

I can't speak to the performance issue, but I have seen a vendor make that claim. We have repeatedly gotten questions raised on the mailing list when users are "surprised" to see 100% cpu utilization, and several of us have gone to significant lengths to explain why it isn't an issue over the years.

I'd say just make it the default to idle so we quit having to explain it, but I'm not by any means sold on that position.

Oct 16 '17 15:10 rhc54

This patch delays the low priority callbacks (which contributes to counting the events) by pausing before giving them a chance to trigger.
It also fails to protect itself in multithreaded scenarios, unlike most of the surrounding code.
Why are we having the MCA parameters at the MPI level, so that we add accessors to OPAL so that we can force a non-coordinated behavior on the progress. Why not having everything at a consistent level, down in OPAL ?

Oct 16 '17 16:10 bosilca

@ggouaillardet I have seen better performance when using nanosleep to put the current thread to sleep. Might be worth looking at that vs poll. The Linux implementation seems pretty fast.

Dec 05 '17 23:12 hjelmn

@ggouaillardet what was the original issue? Can you provide the reference? We have seen issues with Slurm daemons starving of CPUs when spinning in the direct modex case. In this case some processes may have all they need and go calculating while the remote procs may be asking for the EP and the local PMIx server is delayed to respond because it is pushed back by the app procs.

Dec 06 '17 02:12 artpol84

@artpol84 please refer to the thread starting at https://www.mail-archive.com/[email protected]//msg20407.html

long story short, if we while (...) sched_yield();, then top reports 100% usage even if the system remains very responsive since the MPI app spends its time yielding. The goal of this PR is to (virtually) stop CPU usage when nothing is happening. in the case of MPI paraview (which is an interactive program), MPI tasks spend most of the time in sched_yield(), which can be confusing when running top output that states 100% usage.

Dec 06 '17 04:12 ggouaillardet

Can one of the admins verify this patch?

Oct 25 '20 21:10 lanl-ompi

@bosilca do we want this for v5.0?

Aug 30 '22 19:08 gpaulsen

The idea is still of interest, but this PR is stale. If I summarize, we want to have a straightforward approach: after X unsuccessful polls we start yielding and then after another Y unsuccessful polls we nanosleep for a duration Z.

Aug 30 '22 20:08 bosilca

@ggouaillardet You have several PRs that date back 3-6 years - would it make sense for you to triage them and close the ones not worth rebasing, fixing, resubmitting for review, and finally committing?

Feb 01 '23 23:02 rhc54