ejabberd crashes after a few minutes
This seems to be roughly identical to https://github.com/processone/ejabberd/issues/3457 - but that's 2 years old, so re-opening it seemed bad form.
After the recent upgrade to 23.01, my ejabberd crashes, seemingly in due to the memsup system crashing.
Environment
- ejabberd version: 23.01
- Erlang version:
Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 13.1.5 - OS: Linux (Debian)
- Installed from: distro package
Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml
loglevel: 2
log_rotate_size: infinity
hosts:
- "ipom.com"
certfiles:
- "/etc/letsencrypt/live/www.ipom.com/fullchain.pem"
- "/etc/letsencrypt/live/www.ipom.com/privkey.pem"
acme:
auto: false
define_macro:
'TLS_CIPHERS': "HIGH:!aNULL:!eNULL:!3DES:@STRENGTH"
'TLS_OPTIONS':
- "no_sslv3"
- "no_tlsv1"
- "no_tlsv1_1"
- "cipher_server_preference"
- "no_compression"
c2s_ciphers: 'TLS_CIPHERS'
s2s_ciphers: 'TLS_CIPHERS'
c2s_protocol_options: 'TLS_OPTIONS'
s2s_protocol_options: 'TLS_OPTIONS'
listen:
-
port: 5222
ip: "::"
module: ejabberd_c2s
max_stanza_size: 262144
shaper: c2s_shaper
access: c2s
starttls: true
starttls_required: true
protocol_options: 'TLS_OPTIONS'
-
port: 5269
ip: "::"
module: ejabberd_s2s_in
disable_sasl_mechanisms:
- "digest-md5"
- "X-OAUTH2"
s2s_use_starttls: required
auth_method: internal
resource_conflict: closeold
auth_password_format: scram
shaper:
normal:
rate: 3000
burst_size: 20000
fast: 200000
shaper_rules:
max_user_sessions: 10
max_user_offline_messages:
5000: admin
100: all
c2s_shaper:
none: admin
normal: all
s2s_shaper: fast
max_fsm_queue: 1000
acl:
local:
user_regexp: ""
loopback:
ip:
- "127.0.0.0/8"
access_rules:
local:
allow: local
c2s:
deny: blocked
allow: all
announce:
allow: admin
configure:
allow: admin
muc_create:
allow: local
pubsub_createnode:
allow: local
trusted_network:
allow: loopback
api_permissions:
"console commands":
from:
- ejabberd_ctl
who: all
what: "*"
"admin access":
who:
access:
allow:
- acl: loopback
- acl: admin
oauth:
scope: "ejabberd:admin"
access:
allow:
- acl: loopback
- acl: admin
what:
- "*"
- "!stop"
- "!start"
"public commands":
who:
ip: 127.0.0.1/8
what:
- status
- connected_users_number
language: "en"
modules:
mod_adhoc: {}
mod_admin_extra: {}
mod_announce: # recommends mod_adhoc
access: announce
mod_blocking: {} # requires mod_privacy
mod_bosh: {}
mod_caps: {}
mod_carboncopy: {}
mod_configure: {} # requires mod_adhoc
mod_disco: {}
mod_fail2ban: {}
mod_http_api: {}
mod_last: {}
mod_mqtt: {}
mod_muc:
access:
- allow
access_admin:
- allow: admin
access_create: muc_create
access_persistent: muc_create
access_mam:
- allow
default_room_options:
mam: true
mod_muc_admin: {}
mod_offline:
access_max_user_messages: max_user_offline_messages
mod_ping: {}
mod_privacy: {}
mod_private: {}
mod_pubsub:
access_createnode: pubsub_createnode
ignore_pep_from_offline: true
last_item_cache: false
plugins:
- "flat"
- "pep" # pep requires mod_caps
force_node_config:
"eu.siacs.conversations.axolotl.*":
access_model: open
storage:bookmarks:
access_model: whitelist
mod_push: {}
mod_push_keepalive: {}
mod_register:
welcome_message:
subject: "Welcome!"
body: |-
Hi.
Welcome to this XMPP server.
ip_access: trusted_network
access: register
mod_roster:
versioning: true
mod_s2s_dialback: {}
mod_shared_roster: {}
mod_sic: {}
mod_stream_mgmt:
resend_on_timeout: if_offline
mod_stun_disco: {}
mod_stats: {}
mod_time: {}
mod_vcard:
search: false
mod_vcard_xupdate: {}
mod_version: {}
Errors from error.log/crash.log
2023-06-28 21:22:00.228760+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2122316002.1753022465.257150>],dropped}
2023-06-28 22:39:48.010084+00:00 [error] <0.160.0> External eimp process (pid=379215) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010359+00:00 [error] <0.162.0> External eimp process (pid=379216) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010574+00:00 [error] <0.156.0> External eimp process (pid=379213) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.013799+00:00 [error] <0.158.0> External eimp process (pid=379214) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:56.499998+00:00 [error] <0.122.0> application_master: shutdown_error
ejabberd_app: {prep_stop,[[]]}
error_info: {timeout,
{gen_server,call,
[ejabberd_hooks,
{delete,config_reloaded,global,ejabberd_listener,
config_reloaded,50}]}}
2023-06-28 22:40:21.604890+00:00 [error] <0.503.0>@gen_server:error_info/8:1402 ** Generic server memsup terminating
** Last message in was {'EXIT',<0.504.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
{items,{"Memory Usage",
[{"Allocated",7658790912},
{"Total",8311390208}]}},
{items,{"Worst Memory User",
[{"Pid",<0.181.0>},{"Memory",3101864}]}}]
** Reason for termination ==
** {port_died,normal}
2023-06-28 22:40:21.606676+00:00 [error] <0.503.0>@proc_lib:crash_report/4:539 CRASH REPORT:
crasher:
initial call: memsup:init/1
pid: <0.503.0>
registered_name: memsup
exception exit: {port_died,normal}
in function gen_server:handle_common_reply/8 (gen_server.erl, line 1241)
ancestors: [os_mon_sup,<0.501.0>]
message_queue_len: 1
messages: [time_to_collect]
links: [<0.502.0>]
dictionary: [{system_memory_high_watermark,set}]
trap_exit: true
status: running
heap_size: 6772
stack_size: 28
reductions: 26790
neighbours:
2023-06-28 22:40:21.611440+00:00 [error] <0.502.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
supervisor: {local,os_mon_sup}
errorContext: child_terminated
reason: {port_died,normal}
offender: [{pid,<0.503.0>},
{id,memsup},
{mfargs,{memsup,start_link,[]}},
{restart_type,permanent},
{significant,false},
{shutdown,2000},
{child_type,worker}]
2023-06-28 22:40:23.634790+00:00 [error] <0.88.0>@supervisor:do_terminate/2:905 SUPERVISOR REPORT:
supervisor: {local,sasl_safe_sup}
errorContext: shutdown_error
reason: killed
offender: [{pid,<0.89.0>},
{id,alarm_handler},
{mfargs,{alarm_handler,start_link,[]}},
{restart_type,permanent},
{significant,false},
{shutdown,2000},
{child_type,worker}]
crashes after a few minutes
I guess this error reproduces quite easily: you start ejabberd, and it soon crashes itself. Does it wait always the same time to crash (5 minutes, 15 minutes, ...?), or it seems random?
When this happens, is the CPU or RAM usage too high, or are they reasonably low?
Those questions hopefully will indicate if the crash is provoked by some script/configuration, or if it's an event triggered by the server circumstances.
Sometimes it's 30 seconds. Sometimes it's a minute or two. Sometimes 5 or more minutes. However, it's USUALLY 1 minute or so. Here's some data:
2023-06-30T16:07:11.554359+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:26:47.646615+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:28:09.117252+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:29:15.369275+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:30:36.581464+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:31:27.860877+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:32:49.032843+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:33:55.414993+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:35:01.470237+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:36:12.992789+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:42:25.841659+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:43:31.990521+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:45:03.705825+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:46:13.981039+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:20.277436+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:56.758296+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:06.612181+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:57.758462+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:04.118489+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:55.469798+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:58:32.802496+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:59:39.079769+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:00.454025+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:51.818705+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:02:28.562413+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:03:41.236860+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:04:47.763861+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:05:53.794003+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
You can see there 19 minutes between the first two, then just over a minute, then roughly every minute for a while, then 7 minutes...
I can't find any indications of running out of RAM or CPU. Load is between 0.03 and 0.08. There's 3G of ram free. There's no OOM killer events.
Hello I see very same behaviour my site
Environment 2core / 8gb ram Ubuntu 22.04.2 LTS
2:135 Cannot ping watchdog: {badmatch,{error,eacces}}
2023-08-01 06:17:52.706699+00:00 [error] <0.347.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}
2023-08-01 06:18:04.010466+00:00 [notice] <0.66.0> SIGTERM received - shutting down
2023-08-01 06:18:04.010748+00:00 [error] <0.155.0> External eimp process (pid=32894) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.010871+00:00 [error] <0.153.0> External eimp process (pid=32893) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.021968+00:00 [error] <0.347.0>@ejabberd_systemd:handle_cast/2:122 Cannot send STOPPING=1 notification: {badmatch,{error,eacces}}
2023-08-01 06:18:04.022624+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at 10.254.208.140:5222 for ejabberd_c2s
2023-08-01 06:18:04.023243+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5223 for ejabberd_c2s
2023-08-01 06:18:04.023854+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5269 for ejabberd_s2s_in
2023-08-01 06:18:04.024326+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5270 for ejabberd_s2s_in
2023-08-01 06:18:04.024799+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5443 for ejabberd_http
2023-08-01 06:18:04.025289+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5280 for ejabberd_http
2023-08-01 06:18:04.021529+00:00 [error] <0.490.0>@gen_server:error_info/7:949 ** Generic server memsup terminating
** Last message in was {'EXIT',<0.491.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
{items,{"Memory Usage",
[{"Allocated",2591793152},
{"Total",8323547136}]}},
{items,{"Worst Memory User",
[{"Pid",<0.184.0>},{"Memory",55128}]}}]
** Reason for termination ==
** {port_died,normal}
2023-08-01 06:18:04.025770+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting UDP connections at [::]:3479 for ejabberd_stun
2023-08-01 06:18:04.026101+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:1883 for mod_mqtt
2023-08-01 06:18:04.028122+00:00 [error] <0.490.0>@proc_lib:crash_report/4:525 CRASH REPORT:
crasher:
initial call: memsup:init/1
pid: <0.490.0>
registered_name: memsup
exception exit: {port_died,normal}
in function gen_server:handle_common_reply/8 (gen_server.erl, line 811)
ancestors: [os_mon_sup,<0.488.0>]
message_queue_len: 0
messages: []
links: [<0.489.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 2586
stack_size: 29
reductions: 78671
neighbours:
2023-08-01 06:18:04.041527+00:00 [error] <0.489.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
supervisor: {local,os_mon_sup}
errorContext: child_terminated
reason: {port_died,normal}
offender: [{pid,<0.490.0>},
{id,memsup},
{mfargs,{memsup,start_link,[]}},
{restart_type,permanent},
{significant,false},
{shutdown,2000},
{child_type,worker}]
2023-08-01 06:18:04.078386+00:00 [info] <0.120.0>@ejabberd_app:stop/1:111 ejabberd 21.12-1 is stopped in the node ejabberd@localhost
Cannot ping watchdog: {badmatch,{error,eacces}}
Your log starts with that line, that previous cases didn't have. In the erlang documentation it says:
eacces: Missing search or write permissions for the parent directories of Dir.
@badlop - any thoughts on the original issue? The recurring crashes are really frustrating.
Disable watchdog in ejabberd.service, it's probably systemd killing it because ejabberd not being able to communicate with watchdog.
I tried to play around ... install, remove, different version. Once was okay, error not appeared again. When I tried to reproduce again (installed 23.04) meaning setup all environment after install again just failed and getting same error.
when started via ejabberdctl start as root all is okay - no error.
When starting as service or systemctl
systemctl start ejabberd
getting
2023-08-03 16:03:25.269923+00:00 [error] <0.437.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}
some permission issue, trying to identify but cannot see ...
Also see when use systemctl start ejabberd it just hangs, not release prompt
when do ctrl-c will get prompt
root@ub01-ds4:/etc/systemd/system# ps -ef | grep ejabberd root 56685 8641 0 16:00 pts/3 00:00:00 tail -f ejabberd.log ejabberd 56738 1 0 16:02 ? 00:00:00 /bin/sh /opt/ejabberd-23.04/bin/ejabberdctl foreground ejabberd 56749 56738 12 16:02 ? 00:00:18 /opt/ejabberd-23.04/erts-13.2/bin/beam.smp -K true -P 250000 -- -root /opt/ejabberd-23.04 -bindir /opt/ejabberd-23.04/erts-13.2/bin -progname erl -- -home /opt/ejabberd -- -sname ejabberd@localhost -boot ../releases/23.4.0/start_clean -boot_var RELEASE_LIB ../lib -mnesia dir "/opt/ejabberd/database/ejabberd@localhost" -s ejabberd -noshell -noinput ejabberd 56762 1 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/epmd -daemon ejabberd 56766 56749 0 16:02 ? 00:00:00 erl_child_setup 65536 ejabberd 56785 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56786 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56787 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/epam-1.0.12/priv/bin/epam ejabberd 56788 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56789 56788 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56790 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/os_mon-2.8.1/priv/bin/memsup postgres 56792 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41238) idle postgres 56793 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41242) idle postgres 56794 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41250) idle postgres 56795 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41262) idle postgres 56796 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41268) idle postgres 56797 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41280) idle postgres 56798 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41294) idle postgres 56799 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41310) idle postgres 56800 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41328) idle postgres 56801 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41326) idle root 56849 9138 0 16:04 pts/7 00:00:00 grep --color=auto ejabberd
To be clear here, the ORIGINAL issue is, I think, unrelated to the watchdog. I no indication the watchdog is unreachable in my logs.
@weiss that Connor had to do with the watchdog, and while @csrado output was the watchdog, mine (the original bug report) doesn't mention the watchdog. Do you think this commit fixes my/the original bug?
@jaymzh ah you're right, that's most probably unrelated (and @csrado's issue sounds like yet another thing, related to the watchdog but not to my fix).
I'll reopen your issue, sorry.
It looks like this is also probably also from being killed by systemd (all those ports dying looks like external apps run as services getting killed, and because it looks like that those are all the drivers that we start, i would guess that systemd killing process spawned by ejabberd.service are most likely what did that).
@jaymzh, if you'd like to test that fix, you could make a backup of the file
/usr/lib/x86_64-linux-gnu/ejabberd-23.01-1/ebin/ejabberd_systemd.beam
and then replace it with this one. Restart ejabberd and see whether you can still reproduce the problem. (It's built for Erlang/OTP 25.x, so it won't necessarily work for others.)
OK, did that. I'll let you know how it goes...
It's still crashing.
BUT... I just realized, error.log has nothing of use in it - the logs I copied and pasted above, while from that date, don't seem to be the actual reason it's crashing, as that day was the only day it was logged. error.log only has this after that:
2023-06-30 18:11:51.850714+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2524952854.1901395970.103039>],dropped}
2023-07-01 22:37:47.048831+00:00 [error] <0.365.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.4180380941.2434859010.238508>],dropped}
2023-07-04 21:11:23.334292+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2021114025.138215425.185961>],dropped}
2023-07-05 20:04:17.233415+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1227170254.3035430913.28121>],dropped}
2023-07-05 21:43:06.903055+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.3898408711.634716163.122272>],dropped}
2023-07-14 15:36:32.890722+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1932254270.3659857923.247972>],dropped}
2023-07-27 22:00:42.434301+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2250007605.1785790465.240268>],dropped}
2023-08-08 20:51:28.751275+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2106956634.3650682881.157422>],dropped}
But that doesn't seem to be the issue since it fails every few minutes and that's only every few days.
In ejabberd.log, things appear to be progressing normally... logs about local routes then suddenly it's starting again:
2023-08-09 17:17:46.540490+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:324 Local route:
...
2023-08-09 17:17:46.540539+00:00 [debug] <0.539.0>@ejabberd_s2s:start_connection/3:362 Finding connection for {<<"ipom.com">>,<<"gmail.com">>}
2023-08-09 17:17:46.540567+00:00 [debug] <0.539.0>@ejabberd_s2s:choose_pid/2:412 Using ejabberd_s2s_out <0.542.0>
2023-08-09 17:17:46.540592+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:329 Sending to process <0.542.0>
2023-08-09 17:18:19.529779+00:00 [info] <0.122.0>@ejabberd_config:load/1:82 Loading configuration from /etc/ejabberd/ejabberd.yml
2023-08-09 17:18:19.539887+00:00 [notice] <0.122.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'debug'
2023-08-09 17:18:19.718257+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_mam is recommended for module mod_muc but is not found in the config
2023-08-09 17:18:19.724383+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_client_state is recommended for module mod_push_keepalive but is not found in the config
2023-08-09 17:18:19.730664+00:00 [info] <0.122.0>@ejabberd_config:load/1:89 Configuration loaded successfully
2023-08-09 17:18:19.763154+00:00 [debug] <0.73.0>@supervisor:report_progress/2:1557 PROGRESS REPORT:
...
Could you also check syslog/kernlog to see if there is no mentions on beam process?
And the journalctl -u ejabberd output should include a line such as the following if it's related to the watchdog:
ejabberd.service: Watchdog timeout (limit 30s)!
Oh yeah, in syslog:
2023-08-09T17:26:36.040824+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:27:42.290679+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:05.541971+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:56.790673+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:31:18.042185+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:32:09.290852+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:33:30.540562+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:57:30.041086+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
That isn't in the journal though, fwiw. I think that's just because I have debug enabled so the journal has cycled around.
Do you get output from grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log? (That's just to double-check the patch is in effect.)
Yup:
# grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log | tail -n 10
2023-08-09 17:25:20.903814+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:26:42.195441+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:27:48.360714+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:29:11.645013+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:30:02.955898+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:31:24.152766+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:32:15.387876+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:33:36.599136+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:57:36.100613+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 18:11:00.430281+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
Thanks, could you grep for ejabberd_systemd (with loglevel set to debug) and show all output between two crashes?
in ejabberd.log?
2023-08-09 18:42:50.883885+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
{id,ejabberd_systemd},
{mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:42:51.200821+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:43:05.884344+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:20.885429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:57.141885+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:43:57.143461+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:43:57.143534+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
{id,ejabberd_systemd},
{mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:43:57.469138+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:44:12.144167+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:44:48.312132+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:44:48.313653+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:44:48.313731+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
{id,ejabberd_systemd},
{mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:44:48.600797+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:45:03.314257+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:18.315235+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:33.316238+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:09.602756+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:46:09.604392+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:46:09.604471+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
{id,ejabberd_systemd},
{mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:46:09.898492+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:46:24.605393+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:39.606416+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:54.607429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:47:14.288452+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
Okay, thank you. So at some point ejabberd simply stops pinging the watchdog indeed.
As @prefiks said above, you can just disable the systemd watchdog for the moment to make the issue go away.
It seems you can reproduce the issue more easily than others, so I might come back with more debugging wishes, but for the next two weeks I'm on vacation.
Ah missed that. Disabled the watchdog with an override file. Thanks!
Did you run an older version of ejabberd with a similar load before, without running into that issue?
My 'load' is 1 client ;)
And yeah, it was fine for the last 10 years belly the upgrade.
My 'load' is 1 client ;)
And the OS is bored as well?
Either way, seems you can reproduce the issue better than others, who run into it only very occasionally. Do you happen to be up to some debugging, such as me giving you another patched ejabberd_systemd.beam file for testing, and/or running some calls within an ejabberdctl debug shell while reproducing the issue? If so, you could ping me by email (or XMPP): [email protected] (and we could then figure out some way to chat directly).
Yeah happy to - dropped in an updated ejabberd_systemd.beam from you above already, so happy to do more. I'll reach out.