gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Cannot run with SoftRoCE

Open Luo-Liang opened this issue 6 years ago • 2 comments

Hello!

It seems fine to run Gloo with RoCE, but it seems to be stuck with SoftRoCE.

It should just run out of box but it looks like it cannot get pass send. The rendezvous seems fine.

Do you have any ideas?

Luo-Liang avatar Aug 10 '18 03:08 Luo-Liang

Do you have more info about the setup, any environment information, stack traces, etc?

To my knowledge we have never tried running this with soft RoCE.

pietern avatar Aug 11 '18 07:08 pietern

Hi! Thanks for responding!

Let me start by giving the stacktrace. Also please let me know if any other information is needed to zoom in.

Starting arguments:

sudo ./benchmark -s 2 -r 0 -h 172.31.8.100 -t ibverbs --ib-device=rxe0,rxe2,rxe3,rxe1 --elements 104857600 --iteration-time 10 allreduce_halving_doubling

sudo ./benchmark -s 2 -r 1 -h 172.31.8.100 -t ibverbs --tcp-device=rxe1,rxe2,rxe3,rxe0 --elements 104857600 --iteration-time 10 allreduce_halving_doubling

Thread 3 (Thread 0x7f42311f8700 (LWP 23394)):
#0  0x00007f42342a29f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x56464212f750) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x56464212f700, cond=0x56464212f728)
    at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x56464212f728, mutex=0x56464212f700)
    at pthread_cond_wait.c:655
#3  0x00007f4234a1f83c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000056464176d7dc in gloo::benchmark::RunnerThread::spawn (this=0x56464212f6f0)
    at /home/ubuntu/gloo/gloo/benchmark/runner.cc:428
#5  0x00007f4234a2561f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f423429c6db in start_thread (arg=0x7f42311f8700) at pthread_create.c:463
#7  0x00007f4233c2788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f42319f9700 (LWP 23393)):
#0  0x00007f4233c1abf9 in __GI___poll (fds=fds@entry=0x7f42319f8ca8, nfds=nfds@entry=1,
    timeout=timeout@entry=10) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00005646417ab702 in poll (__timeout=10, __nfds=1, __fds=0x7f42319f8ca8)
    at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  gloo::transport::ibverbs::Device::loop (this=0x56464212eb70)
    at /home/ubuntu/gloo/gloo/transport/ibverbs/device.cc:197
#3  0x00007f4234a2561f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f423429c6db in start_thread (arg=0x7f42319f9700) at pthread_create.c:463
#5  0x00007f4233c2788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f4234924740 (LWP 23392)):
#0  0x00007f42342a2f85 in futex_abstimed_wait_cancelable (private=<optimized out>,
    abstime=0x7fffe8c08ed0, expected=0, futex_word=0x56464212ef10)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  __pthread_cond_wait_common (abstime=0x7fffe8c08ed0, mutex=0x56464212eec0,
    cond=0x56464212eee8) at pthread_cond_wait.c:539
#2  __pthread_cond_timedwait (cond=0x56464212eee8, mutex=0x56464212eec0,
    abstime=0x7fffe8c08ed0) at pthread_cond_wait.c:667
#3  0x00005646417bb41d in __gthread_cond_timedwait (__abs_timeout=0x7fffe8c08ed0,
    __mutex=<optimized out>, __cond=0x56464212eee8)
    at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:871
#4  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x56464212eee8)
    at /usr/include/c++/7/condition_variable:166
#5  std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x56464212eee8)
    at /usr/include/c++/7/condition_variable:106
#6  std::condition_variable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, gloo::transport::ibverbs::Pair::getMemoryRegion(int)::<lambda()> > (__p=..., __atime=..., __lock=..., this=0x56464212eee8)
    at /usr/include/c++/7/condition_variable:129
#7  std::condition_variable::wait_for<long int, std::ratio<1, 1000>, gloo::transport::ibverbs::Pair::getMemoryRegion(int)::<lambda()> > (__p=..., __rtime=..., __lock=...,
    this=0x56464212eee8) at /usr/include/c++/7/condition_variable:145
#8  gloo::transport::ibverbs::Pair::getMemoryRegion (this=this@entry=0x56464212ee30,
    slot=0) at /home/ubuntu/gloo/gloo/transport/ibverbs/pair.cc:279
---Type <return> to continue, or q <return> to quit---
#9  0x00005646417bb903 in gloo::transport::ibverbs::Pair::send (this=0x56464212ee30,
    buffer=buffer@entry=0x564642132560, offset=offset@entry=0, length=length@entry=32,
    roffset=roffset@entry=0) at /home/ubuntu/gloo/gloo/transport/ibverbs/pair.cc:487
#10 0x00005646417ceb1d in gloo::transport::ibverbs::Buffer::send (this=0x564642132560,
    offset=0, length=32, roffset=0)
    at /home/ubuntu/gloo/gloo/transport/ibverbs/buffer.cc:193
#11 0x000056464177d754 in gloo::rendezvous::ContextFactory::makeContext (
    this=0x564642132210,
    dev=std::shared_ptr<gloo::transport::Device> (use count 4, weak count 1) = {...})
    at /home/ubuntu/gloo/gloo/rendezvous/context.cc:170
#12 0x000056464176db0b in gloo::benchmark::Runner::newContext (
    this=this@entry=0x7fffe8c09fb0) at /home/ubuntu/gloo/gloo/benchmark/runner.cc:191
#13 0x0000564641770484 in gloo::benchmark::Runner::Runner (this=0x7fffe8c09fb0,
    options=...) at /home/ubuntu/gloo/gloo/benchmark/runner.cc:107
#14 0x000056464172f558 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/gloo/gloo/benchmark/main.cc:425

Luo-Liang avatar Aug 13 '18 04:08 Luo-Liang