Zmq occasionally hangs when ps app exits.
I've been intensively using ps-lite, specifically difacto in wormhole based on ps-lite. It occasionally happens that one difacto.dmlc process (which should be the scheduler) hangs after the workers/servers have finished all the iterations. I reproduced this issue both in local and YARN modes.
And the thread stack dump shows:
(gdb) bt
#0 0x00000031da4df343 in poll () from /lib64/libc.so.6
#1 0x000000000051cfca in zmq::signaler_t::wait (this=0x136eaf8, timeout_=-1) at src/signaler.cpp:218
#2 0x0000000000515be0 in zmq::mailbox_t::recv (this=0x136ea98, cmd_=0x7fffbbeb4670, timeout_=-1) at src/mailbox.cpp:80
#3 0x000000000050f03c in zmq::ctx_t::terminate (this=0x136ea00) at src/ctx.cpp:167
#4 0x000000000046ba14 in ps::Van::~Van (this=0x7b7e28, __in_chrg=<value optimized out>) at src/system/van.cc:24kj
#5 0x0000000000461aef in ps::Manager::~Manager (this=0x7b7cf8, __in_chrg=<value optimized out>) at src/system/manager.cc:16
#6 0x0000000000466b00 in ps::Postoffice::~Postoffice (this=0x7b7c40, __in_chrg=<value optimized out>) at src/system/postoffice.cc:8
#7 0x00000031da435e22 in exit () from /lib64/libc.so.6
#8 0x00000031da41ed24 in __libc_start_main () from /lib64/libc.so.6
#9 0x0000000000409b51 in _start ()
(gdb) info threads
3 Thread 0x7ff3e4ee3700 (LWP 14957) 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
2 Thread 0x7ff3e44e2700 (LWP 14958) 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
* 1 Thread 0x7ff3e4f66780 (LWP 14948) 0x00000031da4df343 in poll () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ff3e44e2700 (LWP 14958))]#0 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1 0x000000000051518b in zmq::epoll_t::loop (this=0x1373200) at src/epoll.cpp:156
#2 0x00000000005277ee in thread_routine (arg_=0x1373280) at src/thread.cpp:96
#3 0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00000031da4e8b6d in clone () from /lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ff3e4ee3700 (LWP 14957))]#0 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x00000031da4e9163 in epoll_wait () from /lib64/libc.so.6
#1 0x000000000051518b in zmq::epoll_t::loop (this=0x136ee30) at src/epoll.cpp:156
#2 0x00000000005277ee in thread_routine (arg_=0x136eeb0) at src/thread.cpp:96
#3 0x00000031dac079d1 in start_thread () from /lib64/libpthread.so.0
#4 0x00000031da4e8b6d in clone () from /lib64/libc.so.6
Can you guys look into this? @mli @tqchen
In zmq::ctx_t::terminate(), it calls term_mailbox.recv() with timeout actually disabled:
// Wait till reaper thread closes all the sockets.
command_t cmd;
int rc = term_mailbox.recv (&cmd, -1);
But I don't think this is the root cause.
Thanks for reporting. I'll look around if other zmq users have a similar problem
Thanks for responding.
This happens as the program is releasing a static object during exit(). Anyway it's not so fatal since iterations have been done and the result model has been saved to filesys.
I've seen something similar when running on yarn. I didn't dig into it in detail, but the basic symptom is that the yarn job finishes fitting, but never exits.