mercury icon indicating copy to clipboard operation
mercury copied to clipboard

NA OFI: hg_test* hangs using verbs;ofi_rxm with concurrent handles

Open carns opened this issue 6 years ago • 8 comments

Examples include hg_test_rpc, hg_test_rpc_lat, and hg_test_write_bw, all of which work correctly until the phase of the test that uses multiple handles or concurrent RPCs. This is with libfabric 1.6 and the mlx5_0 interface.

They still hang when #231 is applied.

carns avatar Aug 06 '18 19:08 carns

Current Mercury origin/master runs all of these tests to completion when built against current libfabric origin/master (without applying any additional PRs).

Need to test this further. What about 1.6.1? What do we do with #231 (maybe we should clean up in the other direction and remove some special cases?).

carns avatar Aug 07 '18 21:08 carns

Since support for verbs RDM has been dropped, I'd be inclined to keep only one of them...

soumagne avatar Aug 07 '18 21:08 soumagne

Agreed re: eliminating the extra verbs path. I just opened #236 to track that separately and closed #231 since that isn't that right approach.

We can continue to use this issue to track the deadlock specifically.

carns avatar Aug 08 '18 00:08 carns

I get the same deadlock when using libfabric 1.6.1 built from a spack package. I'll try creating a spack package that pulls from libfabric master to see if that makes a difference of there was some other difference in my manual of libfabric build that accounted for the change.

carns avatar Aug 08 '18 13:08 carns

That observation about the libfabric version was a false alarm; I also get the hangs if I build libfabric with a modified spack package that pulls from origin master.

The difference was probably an unrelated coincidence, like my manual build being built with debugging symbols.

Need to go back and debug properly.

carns avatar Aug 08 '18 13:08 carns

I was able to trigger the hang while running the example (in particular, the hg_test_read_bw example with #237 applied) with debugging symbols.

The server has one thread running a progress loop while the rest wait on a condition variable. This is probably normal unless an operation has gotten lost and didn't signal completion.

The client is in an HG_Progress() call with an unusually long timeout, which stems from test_read_bw.c calling hg_request_wait() with HG_MAX_IDLE_TIME. I tried recompiling with smaller values ranging from 0 to 1000 in the hg_request_wait() timeout argument so that I could observe data structures while it re-enters the HG_Progress call. Lower timeout values cause the test program to hit this error, though:

# HG -- Error -- /home/carns/working/src/mochi/mercury/src/mercury_core.c:4717
 # HG_Core_forward(): Not safe to use HG core handle, handle is still in use, refcount: 2

Are there any test programs that stress concurrency without using the hg_request api? I'd like to narrow it down to something simpler. If not, I can set up margo runs with multiple OS threads (the benchmarks are capable of this, I just don't normally run it that way) to try a different but not necessarily simpler code path.

carns avatar Aug 09 '18 02:08 carns

Actually the margo version of this (using margo-p2p-bw and varying the -c (concurrency) and -T (number of os threads)) is fairly easy to run. I can't reproduce the hang with it, though. I tried 3 cases, where the concurrency is less than, equal to, or greater than the number of os threads, and none of them exhibited a problem.

I'm marking this as a "minor" issue now, because as best I can tell it is only triggered with the hg_request api. Not sure how widely used that is outside of the test programs right now.

carns avatar Aug 09 '18 02:08 carns

Hmm ok thanks, I will have a look at the hg_request stuff. No all the tests use that for now...

soumagne avatar Aug 09 '18 04:08 soumagne