ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

ejabberd container shuts down after SIGTERM signal

Open etranger7 opened this issue 1 year ago • 6 comments

Environment

  • ejabberd version: 24.07
  • Erlang version: Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 14.2.1
  • OS: Linux (Debian 12)
  • Installed from: https://hub.docker.com/r/ejabberd/ecs/ (although the same happens with ghcr.io/processone/ejabberd)

Configuration

We have our custom config but also tried running the container without providing any config file and in both cases we encountered this issue. So I don't want to crowd this space with my config unless necessary.

Errors

Nothing in the error.log within an hour of the crash.

Bug description

The container shuts down gracefully after receiving a SIGTERM signal. I couldn't find anythin in the container and docker deamon logs that would trigger this. However journalctl returned the below entries, right when the container received the SIGTERM signal both times that I investigated.

Sep 24 22:32:54 vps-d97c0e90 sshd[453805]: Timeout, client not responding from user debian [IP removed] port 55848                                
Sep 24 22:32:54 vps-d97c0e90 sshd[453798]: pam_unix(sshd:session): session closed for user debian                                                  
Sep 24 22:32:54 vps-d97c0e90 systemd-logind[537]: Session 26 logged out. Waiting for processes to exit.                                            
Sep 24 22:32:54 vps-d97c0e90 su[453813]: pam_unix(su:session): session closed for user root                                                        
Sep 24 22:32:54 vps-d97c0e90 sudo[453811]: pam_unix(sudo:session): session closed for user root                                                    
Sep 24 22:32:55 vps-d97c0e90 systemd[1]: docker-142eb815f597a29d4844c61ce33199ed669f3cfad01bbf4c1010c422010dd8ac.scope: Deactivated successfully.  
Sep 24 22:32:55 vps-d97c0e90 systemd[1]: docker-142eb815f597a29d4844c61ce33199ed669f3cfad01bbf4c1010c422010dd8ac.scope: Consumed 30.948s CPU time

This occurance of this ranges from within 20 minutes to 22 hours after container start. I'm submitting the issue here because it's happening with both versions of the container.

Any ideas on how the above could be related to ejabberd shutdown would be much appreciated.

etranger7 avatar Sep 25 '24 14:09 etranger7

possibly same as #4054 (systemd watchdog related), but you haven't provided ANY logs from ejabberd so who knows

nosnilmot avatar Sep 25 '24 16:09 nosnilmot

Thank you for your reply @nosnilmot

I'm not sure how much I need to provide but here are the logs a around the time when SIGTERM is received:

2024-09-24 22:32:37.972712+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_filter_send: mod_client_state:filter_other/1
2024-09-24 22:32:37.972882+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_push:c2s_stanza/3
2024-09-24 22:32:37.972938+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_push_keepalive:c2s_stanza/3
2024-09-24 22:32:37.973013+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_stream_mgmt:c2s_handle_send/3
2024-09-24 22:32:38.049730+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_recv: mod_stream_mgmt:c2s_handle_recv/3
2024-09-24 22:32:38.049909+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_authenticated_packet: mod_client_state:c2s_authenticated_packet/2
2024-09-24 22:32:38.050029+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_authenticated_packet: mod_stream_mgmt:c2s_authenticated_packet/2
2024-09-24 22:32:38.050118+00:00 [debug] <0.915.0>@mod_stream_mgmt:check_h_attribute/2:531 [email protected]/129656514319119432402978 acknowledged 44 of 44 stanzas
2024-09-24 22:32:50.280564+00:00 [info] <0.561.0>@ejabberd_listener:accept/7:346 (<0.987.0>) Accepted connection 5.39.15.152:15556 -> 172.18.0.2:5281
2024-09-24 22:32:50.280678+00:00 [debug] <0.987.0>@ejabberd_http:init/3:130 S: [{[<<"api">>],mod_http_api}]

2024-09-24 22:32:54.344417+00:00 [notice] <0.70.0> SIGTERM received - shutting down

2024-09-24 22:32:54.526859+00:00 [debug] <0.198.0>@ejabberd_systemd:cast_notification/1:188 Closing NOTIFY_SOCKET
2024-09-24 22:32:54.526182+00:00 [notice] <0.70.0> SIGTERM received - shutting down

2024-09-24 22:32:54.526999+00:00 [debug] <0.529.0>@ejabberd_systemd:handle_cast/2:107 No NOTIFY_SOCKET, dropping STOPPING=1 notification
2024-09-24 22:32:54.527283+00:00 [info] <0.198.0>@ejabberd_listener:stop_listener/3:478 Stop accepting TCP connections at [::]:5222 for ejabberd_c2s
2024-09-24 22:32:54.527608+00:00 [info] <0.198.0>@ejabberd_listener:stop_listener/3:478 Stop accepting TLS connections at 0.0.0.0:5269 for ejabberd_s2s_in
2024-09-24 22:32:54.527814+00:00 [info] <0.198.0>@ejabberd_listener:stop_listener/3:478 Stop accepting TCP connections at 0.0.0.0:5280 for ejabberd_http
2024-09-24 22:32:54.528000+00:00 [info] <0.198.0>@ejabberd_listener:stop_listener/3:478 Stop accepting TCP connections at 0.0.0.0:5281 for ejabberd_http
2024-09-24 22:32:54.528182+00:00 [info] <0.198.0>@ejabberd_listener:stop_listener/3:478 Stop accepting TLS connections at 0.0.0.0:5443 for ejabberd_http
2024-09-24 22:32:54.528728+00:00 [debug] <0.962.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_push:c2s_stanza/3
2024-09-24 22:32:54.528748+00:00 [debug] <0.744.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_push:c2s_stanza/3
2024-09-24 22:32:54.528867+00:00 [debug] <0.915.0>@ejabberd_hooks:safe_apply/4:451 Running hook c2s_handle_send: mod_push:c2s_stanza/3

and then ejabberd follows a shutdown sequence stopping things, (there are hundreds of lines in between the above logs and the ones below) and ending with:

2024-09-24 22:32:54.563754+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_stream_mgmt at sub.domain.com
2024-09-24 22:32:54.563910+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_s2s_dialback at sub.domain.com
2024-09-24 22:32:54.567675+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_http_api at sub.domain.com
2024-09-24 22:32:54.567929+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_sip at sub.domain.com
2024-09-24 22:32:54.568053+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_http_upload at sub.domain.com
2024-09-24 22:32:54.568415+00:00 [debug] <0.623.0>@mod_http_upload:terminate/2:511 Stopping HTTP upload process for sub.domain.com: shutdown
2024-09-24 22:32:54.568789+00:00 [debug] <0.623.0>@ejabberd_router:unregister_route/2:204 Route unregistered: upload.sub.domain.com
2024-09-24 22:32:54.568988+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_roster at sub.domain.com
2024-09-24 22:32:54.569162+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_client_state at sub.domain.com
2024-09-24 22:32:54.569470+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_register at sub.domain.com
2024-09-24 22:32:54.569575+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_version at sub.domain.com
2024-09-24 22:32:54.569629+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_pubsub at sub.domain.com
2024-09-24 22:32:54.570256+00:00 [debug] <0.616.0>@ejabberd_router:unregister_route/2:204 Route unregistered: pubsub.sub.domain.com
2024-09-24 22:32:54.570508+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_caps at sub.domain.com
2024-09-24 22:32:54.571081+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_admin_extra at sub.domain.com
2024-09-24 22:32:54.572236+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_offline at sub.domain.com
2024-09-24 22:32:54.572528+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_announce at sub.domain.com
2024-09-24 22:32:54.572850+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_mam at sub.domain.com
2024-09-24 22:32:54.573474+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_ping at sub.domain.com
2024-09-24 22:32:54.573795+00:00 [debug] <0.198.0>@gen_mod:stop_module_keep_config/2:319 Stopping mod_adhoc at sub.domain.com
2024-09-24 22:32:54.574677+00:00 [debug] <0.590.0>@ejabberd_router:unregister_route/2:204 Route unregistered: sub.domain.com
2024-09-24 22:32:54.581445+00:00 [debug] <0.529.0>@ejabberd_systemd:terminate/2:133 Terminating ejabberd_systemd (shutdown)
2024-09-24 22:32:54.581583+00:00 [debug] <0.529.0>@ejabberd_systemd:cancel_timer/1:177 Cancelling watchdog timer
2024-09-24 22:32:54.581832+00:00 [info] <0.198.0>@ejabberd_app:stop/1:112 ejabberd 24.7.0 is stopped in the node ejabberd@xmpp

etranger7 avatar Sep 25 '24 17:09 etranger7

I do not think, that systemd is anywhere involved here, because you said you run it in a container. Either Docker, Podman or any other orchestrator. A SIGTERM would be coming from the respective container orchestrator. In Kubernetes for example, this happens if you restart a Deployment, a container fails its LivenessProbe, etc.

So from my point SIGTERM was triggered from the container orchestrator, so I am not yet sure if this issue is really related to the container image here.

The only thing I can image, why the orchestrator could send a SIGTERM, is, if the Healthcheck probe which is configured in the container image fails and the orchestrator takes any action upon it. But we would need to ask, e.g. the Docker daemon logs or else to find out such events.

What orchestrator do you use? And do you care to share e.g. the docker-compose.yml snippet, etc. so we can try to reproduce.

Thank you 👍

sando38 avatar Sep 26 '24 17:09 sando38

I do not think, that systemd is anywhere involved here, because you said you run it in a container

ah yes, the originally provided logs that mentioned systemd are host logs.

it looks to me like docker container is started in foreground from an ssh session, and then gets terminated when that ssh session gets disconnected, as would be expected

nosnilmot avatar Sep 26 '24 18:09 nosnilmot

Makes sense. I just checked and when I intentionally stop the container, the logs also show SIGTERM followed by a bunch of services being stopped and ending with the "cancelling watchdog timer" message.

@sando38 I don't mind sharing the compose file but I'll hold off on it until this happens again. After happening back to back a ton of times, now the container is stable and has been up for over 30 hours for the first time. I just don't want to take up your time trying to reproduce it, if it no longer is happening even on my end.

etranger7 avatar Sep 27 '24 22:09 etranger7

After further investigation, we concluded that this is caused by using docker compose up instead of docker compose up -d We're running docker in non-root mode ( "userns-remap": "default" ) which might be contributing to this as well/

It looks like once the SSH session of the root user that ran compose up expires, the kernel kills processes related to it which then translates into sigterm.

I was using compose up and compose up -d interchangeably for a long time without issues but since finding this connection, I'm no longer using compose up without -d and things are working fine.

I don't think this is solely related to ejabberd therefore I'm closing this issue. Thank you for your help in narrowing down the possibilities and figuring this out.

etranger7 avatar Oct 31 '24 19:10 etranger7