catalyst icon indicating copy to clipboard operation
catalyst copied to clipboard

MistContoller fails to start other components

Open leszko opened this issue 3 years ago • 3 comments

There is some race condition, which makes the execution of the livepeer/catalyst container flaky. In result, sometimes, when catalyst is started, there are no logs at all and the only running process is MistController.

Steps to reproduce

Start multiple livepeer catalyst and find the ones with no logs

for i in $(seq 0 100); do CID=$(docker run -d --rm livepeer/catalyst); sleep 5; LOGS=$(docker logs $CID); [[ -z "$LOGS" ]] || docker kill $CID;done

This command will start 100 container one by one and kill all expect the ones that produce no logs. Then check the running containers and if there are none, then retry the command above. In my case, there was always at least one running container left.

$ docker ps
CONTAINER ID   IMAGE                          COMMAND                  CREATED         STATUS         PORTS                                    NAMES
79e3ec38e0c6   livepeer/catalyst              "/usr/bin/MistContro…"   8 minutes ago   Up 8 minutes   1935/tcp, 4242/tcp, 8080/tcp, 8889/udp   sleepy_poincare

2. Exec into the container and check the running processes

You should find that the only running process is MistController.

$ docker exec -it 79e3ec38e0c6 sh
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  16080  4340 ?        Ssl  13:47   0:00 /usr/bin/MistController -c /etc/livepeer/catalyst.json
root        14  0.1  0.0   2060   496 pts/0    Ss   13:57   0:00 sh
root        20  0.0  0.0   5480  2384 pts/0    R+   13:57   0:00 ps aux

leszko avatar Jul 29 '22 13:07 leszko

Interesting! Can you get a gdb backtrace of the "stuck" controller?

Thulinma avatar Jul 29 '22 22:07 Thulinma

There are always 2 MistController processes running. Here are backtraces.

# gdb MistController 26335
...
MistController: No such file or directory.
Attaching to process 26335
[New LWP 26368]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x00007f0f39222c7f in __GI___wait4 (pid=7, stat_loc=0x7ffcc30467cc, options=0, usage=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:27
27      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.

(gdb) backtrace
#0  0x00007f0f39222c7f in __GI___wait4 (pid=7, stat_loc=0x7ffcc30467cc, options=0, usage=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:27
#1  0x000055fb50a46b9a in main (argc=3, argv=0x7ffcc30469e8)
    at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/src/controller/controller.cpp:677
# gdb MistController 26367
...
For help, type "help".
Type "apropos word" to search for commands related to "word"...
MistController: No such file or directory.
Attaching to process 26367
[New LWP 26369]
[New LWP 26370]
[New LWP 26371]
[New LWP 26605]
[New LWP 26606]
[New LWP 26607]
[New LWP 26609]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x00007f0f3969b4ff in __libc_accept (fd=0, addr=..., addr@entry=..., len=len@entry=0x7ffcc3045b48)
    at ../sysdeps/unix/sysv/linux/accept.c:26
26      ../sysdeps/unix/sysv/linux/accept.c: No such file or directory.

(gdb) backtrace
#0  0x00007f0f3969b4ff in __libc_accept (fd=0, addr=..., addr@entry=..., len=len@entry=0x7ffcc3045b48)
    at ../sysdeps/unix/sysv/linux/accept.c:26
#1  0x000055fb50af2995 in Socket::Server::accept (this=this@entry=0x7ffcc3045e50, 
    nonblock=nonblock@entry=false) at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/lib/socket.cpp:1512
#2  0x000055fb50aa5199 in Util::Config::threadServer (this=this@entry=0x55fb50ed81a0 <Controller::conf>, 
    server_socket=..., callback=callback@entry=
    0x55fb50a92db0 <Controller::handleAPIConnection(Socket::Connection&)>)
    at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/lib/config.cpp:457
#3  0x000055fb50aa55e6 in Util::Config::serveThreadedSocket (this=0x55fb50ed81a0 <Controller::conf>, 
    callback=0x55fb50a92db0 <Controller::handleAPIConnection(Socket::Connection&)>)
    at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/lib/config.cpp:521
#4  0x000055fb50a4c5e4 in main_loop (argc=<optimized out>, argv=<optimized out>)
    at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/src/controller/controller.cpp:589
#5  0x000055fb50a46d05 in main (argc=3, argv=0x7ffcc30469e8)
    at /tmp/drone-q2IcpPwEcM8EB3pN/drone/src/src/controller/controller.cpp:668

leszko avatar Aug 01 '22 14:08 leszko

@Thulinma any news on that issue? Do you maybe know where it comes from?

leszko avatar Aug 18 '22 12:08 leszko