ps-lite icon indicating copy to clipboard operation
ps-lite copied to clipboard

Zmq occasionally hangs when ps app exits.

Open BaiGang opened this issue 10 years ago • 4 comments

I've been intensively using ps-lite, specifically difacto in wormhole based on ps-lite. It occasionally happens that one difacto.dmlc process (which should be the scheduler) hangs after the workers/servers have finished all the iterations. I reproduced this issue both in local and YARN modes.

And the thread stack dump shows:

(gdb) bt
#0  0x00000031da4df343 in poll () from /lib64/libc.so.6
#1  0x000000000051cfca in zmq::signaler_t::wait (this=0x136eaf8, timeout_=-1) at src/signaler.cpp:218
#2  0x0000000000515be0 in zmq::mailbox_t::recv (this=0x136ea98, cmd_=0x7fffbbeb4670, timeout_=-1) at src/mailbox.cpp:80
#3  0x000000000050f03c in zmq::ctx_t::terminate (this=0x136ea00) at src/ctx.cpp:167
#4  0x000000000046ba14 in ps::Van::~Van (this=0x7b7e28, __in_chrg=<value optimized out>) at src/system/van.cc:24kj
#5  0x0000000000461aef in ps::Manager::~Manager (this=0x7b7cf8, __in_chrg=<value optimized out>) at src/system/manager.cc:16
#6  0x0000000000466b00 in ps::Postoffice::~Postoffice (this=0x7b7c40, __in_chrg=<value optimized out>) at src/system/postoffice.cc:8
#7  0x00000031da435e22 in exit () from /lib64/libc.so.6
#8  0x00000031da41ed24 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000409b51 in _start ()
(gdb) info threads
  3 Thread 0x7ff3e4ee3700 (LWP 14957)  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
  2 Thread 0x7ff3e44e2700 (LWP 14958)  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
* 1 Thread 0x7ff3e4f66780 (LWP 14948)  0x00000031da4df343 in poll () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ff3e44e2700 (LWP 14958))]#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000051518b in zmq::epoll_t::loop (this=0x1373200) at src/epoll.cpp:156
#2  0x00000000005277ee in thread_routine (arg_=0x1373280) at src/thread.cpp:96
#3  0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000031da4e8b6d in clone () from /lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ff3e4ee3700 (LWP 14957))]#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000051518b in zmq::epoll_t::loop (this=0x136ee30) at src/epoll.cpp:156
#2  0x00000000005277ee in thread_routine (arg_=0x136eeb0) at src/thread.cpp:96
#3  0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000031da4e8b6d in clone () from /lib64/libc.so.6

Can you guys look into this? @mli @tqchen

BaiGang avatar Nov 24 '15 11:11 BaiGang

In zmq::ctx_t::terminate(), it calls term_mailbox.recv() with timeout actually disabled:

        //  Wait till reaper thread closes all the sockets.
        command_t cmd;
        int rc = term_mailbox.recv (&cmd, -1);

But I don't think this is the root cause.

BaiGang avatar Nov 24 '15 11:11 BaiGang

Thanks for reporting. I'll look around if other zmq users have a similar problem

mli avatar Nov 24 '15 14:11 mli

Thanks for responding.

This happens as the program is releasing a static object during exit(). Anyway it's not so fatal since iterations have been done and the result model has been saved to filesys.

BaiGang avatar Nov 30 '15 13:11 BaiGang

I've seen something similar when running on yarn. I didn't dig into it in detail, but the basic symptom is that the yarn job finishes fitting, but never exits.

neggert avatar Nov 30 '15 15:11 neggert