ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

ejabberd crashes after a few minutes

Open jaymzh opened this issue 2 years ago • 58 comments

This seems to be roughly identical to https://github.com/processone/ejabberd/issues/3457 - but that's 2 years old, so re-opening it seemed bad form.

After the recent upgrade to 23.01, my ejabberd crashes, seemingly in due to the memsup system crashing.

Environment

  • ejabberd version: 23.01
  • Erlang version: Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 13.1.5
  • OS: Linux (Debian)
  • Installed from: distro package

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

loglevel: 2
log_rotate_size: infinity
hosts:
  - "ipom.com"
certfiles:
  - "/etc/letsencrypt/live/www.ipom.com/fullchain.pem"
  - "/etc/letsencrypt/live/www.ipom.com/privkey.pem"
acme:
  auto: false
define_macro:
  'TLS_CIPHERS': "HIGH:!aNULL:!eNULL:!3DES:@STRENGTH"
  'TLS_OPTIONS':
    - "no_sslv3"
    - "no_tlsv1"
    - "no_tlsv1_1"
    - "cipher_server_preference"
    - "no_compression"
c2s_ciphers: 'TLS_CIPHERS'
s2s_ciphers: 'TLS_CIPHERS'
c2s_protocol_options: 'TLS_OPTIONS'
s2s_protocol_options: 'TLS_OPTIONS'
listen: 
  - 
    port: 5222
    ip: "::"
    module: ejabberd_c2s
    max_stanza_size: 262144
    shaper: c2s_shaper
    access: c2s
    starttls: true
    starttls_required: true
    protocol_options: 'TLS_OPTIONS'
  - 
    port: 5269
    ip: "::"
    module: ejabberd_s2s_in
disable_sasl_mechanisms:
  - "digest-md5"
  - "X-OAUTH2"
s2s_use_starttls: required
auth_method: internal
resource_conflict: closeold
auth_password_format: scram
shaper:
  normal:
    rate: 3000
    burst_size: 20000
  fast: 200000
shaper_rules:
  max_user_sessions: 10
  max_user_offline_messages:
    5000: admin
    100: all
  c2s_shaper:
    none: admin
    normal: all
  s2s_shaper: fast
max_fsm_queue: 1000
acl:
  local: 
    user_regexp: ""
  loopback:
    ip:
      - "127.0.0.0/8"
access_rules:
  local: 
    allow: local
  c2s: 
    deny: blocked
    allow: all
  announce: 
    allow: admin
  configure: 
    allow: admin
  muc_create: 
    allow: local
  pubsub_createnode: 
    allow: local
  trusted_network: 
    allow: loopback
api_permissions:
  "console commands":
    from:
      - ejabberd_ctl
    who: all
    what: "*"
  "admin access":
    who:
      access:
        allow:
          - acl: loopback
          - acl: admin
      oauth:
        scope: "ejabberd:admin"
        access:
          allow:
            - acl: loopback
            - acl: admin
    what:
      - "*"
      - "!stop"
      - "!start"
  "public commands":
    who:
      ip: 127.0.0.1/8
    what:
      - status
      - connected_users_number
  
language: "en"
modules: 
  mod_adhoc: {}
  mod_admin_extra: {}
  mod_announce: # recommends mod_adhoc
    access: announce
  mod_blocking: {} # requires mod_privacy
  mod_bosh: {}
  mod_caps: {}
  mod_carboncopy: {}
  mod_configure: {} # requires mod_adhoc
  mod_disco: {}
  mod_fail2ban: {}
  mod_http_api: {}
  mod_last: {}
  mod_mqtt: {}
  mod_muc: 
    access:
      - allow
    access_admin:
      - allow: admin
    access_create: muc_create
    access_persistent: muc_create
    access_mam:
      - allow
    default_room_options:
      mam: true
  mod_muc_admin: {}
  mod_offline: 
    access_max_user_messages: max_user_offline_messages
  mod_ping: {}
  mod_privacy: {}
  mod_private: {}
  mod_pubsub: 
    access_createnode: pubsub_createnode
    ignore_pep_from_offline: true
    last_item_cache: false
    plugins: 
      - "flat"
      - "pep" # pep requires mod_caps
    force_node_config:
      "eu.siacs.conversations.axolotl.*":
        access_model: open
      storage:bookmarks:
        access_model: whitelist
  mod_push: {}
  mod_push_keepalive: {}
  mod_register: 
    welcome_message: 
      subject: "Welcome!"
      body: |-
        Hi.
        Welcome to this XMPP server.
    ip_access: trusted_network
    access: register
  mod_roster:
    versioning: true
  mod_s2s_dialback: {}
  mod_shared_roster: {}
  mod_sic: {}
  mod_stream_mgmt:
    resend_on_timeout: if_offline
  mod_stun_disco: {}
  mod_stats: {}
  mod_time: {}
  mod_vcard:
    search: false
  mod_vcard_xupdate: {}
  mod_version: {}

Errors from error.log/crash.log

2023-06-28 21:22:00.228760+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2122316002.1753022465.257150>],dropped}
2023-06-28 22:39:48.010084+00:00 [error] <0.160.0> External eimp process (pid=379215) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010359+00:00 [error] <0.162.0> External eimp process (pid=379216) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010574+00:00 [error] <0.156.0> External eimp process (pid=379213) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.013799+00:00 [error] <0.158.0> External eimp process (pid=379214) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:56.499998+00:00 [error] <0.122.0>     application_master: shutdown_error
    ejabberd_app: {prep_stop,[[]]}
    error_info: {timeout,
                    {gen_server,call,
                        [ejabberd_hooks,
                         {delete,config_reloaded,global,ejabberd_listener,
                             config_reloaded,50}]}}
2023-06-28 22:40:21.604890+00:00 [error] <0.503.0>@gen_server:error_info/8:1402 ** Generic server memsup terminating 
** Last message in was {'EXIT',<0.504.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
                         {items,{"Memory Usage",
                                 [{"Allocated",7658790912},
                                  {"Total",8311390208}]}},
                         {items,{"Worst Memory User",
                                 [{"Pid",<0.181.0>},{"Memory",3101864}]}}]
** Reason for termination ==
** {port_died,normal}

2023-06-28 22:40:21.606676+00:00 [error] <0.503.0>@proc_lib:crash_report/4:539 CRASH REPORT:
  crasher:
    initial call: memsup:init/1
    pid: <0.503.0>
    registered_name: memsup
    exception exit: {port_died,normal}
      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 1241)
    ancestors: [os_mon_sup,<0.501.0>]
    message_queue_len: 1
    messages: [time_to_collect]
    links: [<0.502.0>]
    dictionary: [{system_memory_high_watermark,set}]
    trap_exit: true
    status: running
    heap_size: 6772
    stack_size: 28
    reductions: 26790
  neighbours:

2023-06-28 22:40:21.611440+00:00 [error] <0.502.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
    supervisor: {local,os_mon_sup}
    errorContext: child_terminated
    reason: {port_died,normal}
    offender: [{pid,<0.503.0>},
               {id,memsup},
               {mfargs,{memsup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]

2023-06-28 22:40:23.634790+00:00 [error] <0.88.0>@supervisor:do_terminate/2:905 SUPERVISOR REPORT:
    supervisor: {local,sasl_safe_sup}
    errorContext: shutdown_error
    reason: killed
    offender: [{pid,<0.89.0>},
               {id,alarm_handler},
               {mfargs,{alarm_handler,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]

jaymzh avatar Jun 28 '23 23:06 jaymzh

crashes after a few minutes

I guess this error reproduces quite easily: you start ejabberd, and it soon crashes itself. Does it wait always the same time to crash (5 minutes, 15 minutes, ...?), or it seems random?

When this happens, is the CPU or RAM usage too high, or are they reasonably low?

Those questions hopefully will indicate if the crash is provoked by some script/configuration, or if it's an event triggered by the server circumstances.

badlop avatar Jun 30 '23 16:06 badlop

Sometimes it's 30 seconds. Sometimes it's a minute or two. Sometimes 5 or more minutes. However, it's USUALLY 1 minute or so. Here's some data:

2023-06-30T16:07:11.554359+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:26:47.646615+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:28:09.117252+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:29:15.369275+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:30:36.581464+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:31:27.860877+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:32:49.032843+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:33:55.414993+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:35:01.470237+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:36:12.992789+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:42:25.841659+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:43:31.990521+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:45:03.705825+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:46:13.981039+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:20.277436+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:56.758296+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:06.612181+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:57.758462+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:04.118489+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:55.469798+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:58:32.802496+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:59:39.079769+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:00.454025+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:51.818705+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:02:28.562413+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:03:41.236860+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:04:47.763861+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:05:53.794003+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).

You can see there 19 minutes between the first two, then just over a minute, then roughly every minute for a while, then 7 minutes...

I can't find any indications of running out of RAM or CPU. Load is between 0.03 and 0.08. There's 3G of ram free. There's no OOM killer events.

jaymzh avatar Jun 30 '23 17:06 jaymzh

Hello I see very same behaviour my site

Environment 2core / 8gb ram Ubuntu 22.04.2 LTS

2:135 Cannot ping watchdog: {badmatch,{error,eacces}}

2023-08-01 06:17:52.706699+00:00 [error] <0.347.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}
2023-08-01 06:18:04.010466+00:00 [notice] <0.66.0> SIGTERM received - shutting down

2023-08-01 06:18:04.010748+00:00 [error] <0.155.0> External eimp process (pid=32894) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.010871+00:00 [error] <0.153.0> External eimp process (pid=32893) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.021968+00:00 [error] <0.347.0>@ejabberd_systemd:handle_cast/2:122 Cannot send STOPPING=1 notification: {badmatch,{error,eacces}}
2023-08-01 06:18:04.022624+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at 10.254.208.140:5222 for ejabberd_c2s
2023-08-01 06:18:04.023243+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5223 for ejabberd_c2s
2023-08-01 06:18:04.023854+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5269 for ejabberd_s2s_in
2023-08-01 06:18:04.024326+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5270 for ejabberd_s2s_in
2023-08-01 06:18:04.024799+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5443 for ejabberd_http
2023-08-01 06:18:04.025289+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5280 for ejabberd_http
2023-08-01 06:18:04.021529+00:00 [error] <0.490.0>@gen_server:error_info/7:949 ** Generic server memsup terminating 
** Last message in was {'EXIT',<0.491.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
                         {items,{"Memory Usage",
                                 [{"Allocated",2591793152},
                                  {"Total",8323547136}]}},
                         {items,{"Worst Memory User",
                                 [{"Pid",<0.184.0>},{"Memory",55128}]}}]
** Reason for termination ==
** {port_died,normal}

2023-08-01 06:18:04.025770+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting UDP connections at [::]:3479 for ejabberd_stun
2023-08-01 06:18:04.026101+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:1883 for mod_mqtt
2023-08-01 06:18:04.028122+00:00 [error] <0.490.0>@proc_lib:crash_report/4:525 CRASH REPORT:
  crasher:
    initial call: memsup:init/1
    pid: <0.490.0>
    registered_name: memsup
    exception exit: {port_died,normal}
      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 811)
    ancestors: [os_mon_sup,<0.488.0>]
    message_queue_len: 0
    messages: []
    links: [<0.489.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 29
    reductions: 78671
  neighbours:

2023-08-01 06:18:04.041527+00:00 [error] <0.489.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
    supervisor: {local,os_mon_sup}
    errorContext: child_terminated
    reason: {port_died,normal}
    offender: [{pid,<0.490.0>},
               {id,memsup},
               {mfargs,{memsup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]

2023-08-01 06:18:04.078386+00:00 [info] <0.120.0>@ejabberd_app:stop/1:111 ejabberd 21.12-1 is stopped in the node ejabberd@localhost

csrado avatar Aug 01 '23 06:08 csrado

Cannot ping watchdog: {badmatch,{error,eacces}}

Your log starts with that line, that previous cases didn't have. In the erlang documentation it says:

eacces: Missing search or write permissions for the parent directories of Dir.

badlop avatar Aug 02 '23 08:08 badlop

@badlop - any thoughts on the original issue? The recurring crashes are really frustrating.

jaymzh avatar Aug 02 '23 22:08 jaymzh

Disable watchdog in ejabberd.service, it's probably systemd killing it because ejabberd not being able to communicate with watchdog.

prefiks avatar Aug 03 '23 06:08 prefiks

I tried to play around ... install, remove, different version. Once was okay, error not appeared again. When I tried to reproduce again (installed 23.04) meaning setup all environment after install again just failed and getting same error.

when started via ejabberdctl start as root all is okay - no error.

When starting as service or systemctl

systemctl start ejabberd

getting

2023-08-03 16:03:25.269923+00:00 [error] <0.437.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}

some permission issue, trying to identify but cannot see ...

csrado avatar Aug 03 '23 16:08 csrado

Also see when use systemctl start ejabberd it just hangs, not release prompt

when do ctrl-c will get prompt

root@ub01-ds4:/etc/systemd/system# ps -ef | grep ejabberd root 56685 8641 0 16:00 pts/3 00:00:00 tail -f ejabberd.log ejabberd 56738 1 0 16:02 ? 00:00:00 /bin/sh /opt/ejabberd-23.04/bin/ejabberdctl foreground ejabberd 56749 56738 12 16:02 ? 00:00:18 /opt/ejabberd-23.04/erts-13.2/bin/beam.smp -K true -P 250000 -- -root /opt/ejabberd-23.04 -bindir /opt/ejabberd-23.04/erts-13.2/bin -progname erl -- -home /opt/ejabberd -- -sname ejabberd@localhost -boot ../releases/23.4.0/start_clean -boot_var RELEASE_LIB ../lib -mnesia dir "/opt/ejabberd/database/ejabberd@localhost" -s ejabberd -noshell -noinput ejabberd 56762 1 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/epmd -daemon ejabberd 56766 56749 0 16:02 ? 00:00:00 erl_child_setup 65536 ejabberd 56785 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56786 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56787 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/epam-1.0.12/priv/bin/epam ejabberd 56788 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56789 56788 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56790 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/os_mon-2.8.1/priv/bin/memsup postgres 56792 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41238) idle postgres 56793 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41242) idle postgres 56794 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41250) idle postgres 56795 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41262) idle postgres 56796 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41268) idle postgres 56797 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41280) idle postgres 56798 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41294) idle postgres 56799 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41310) idle postgres 56800 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41328) idle postgres 56801 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41326) idle root 56849 9138 0 16:04 pts/7 00:00:00 grep --color=auto ejabberd

csrado avatar Aug 03 '23 16:08 csrado

To be clear here, the ORIGINAL issue is, I think, unrelated to the watchdog. I no indication the watchdog is unreachable in my logs.

jaymzh avatar Aug 03 '23 20:08 jaymzh

@weiss that Connor had to do with the watchdog, and while @csrado output was the watchdog, mine (the original bug report) doesn't mention the watchdog. Do you think this commit fixes my/the original bug?

jaymzh avatar Aug 09 '23 15:08 jaymzh

@jaymzh ah you're right, that's most probably unrelated (and @csrado's issue sounds like yet another thing, related to the watchdog but not to my fix).

I'll reopen your issue, sorry.

weiss avatar Aug 09 '23 15:08 weiss

It looks like this is also probably also from being killed by systemd (all those ports dying looks like external apps run as services getting killed, and because it looks like that those are all the drivers that we start, i would guess that systemd killing process spawned by ejabberd.service are most likely what did that).

prefiks avatar Aug 09 '23 16:08 prefiks

@jaymzh, if you'd like to test that fix, you could make a backup of the file

/usr/lib/x86_64-linux-gnu/ejabberd-23.01-1/ebin/ejabberd_systemd.beam

and then replace it with this one. Restart ejabberd and see whether you can still reproduce the problem. (It's built for Erlang/OTP 25.x, so it won't necessarily work for others.)

weiss avatar Aug 09 '23 16:08 weiss

OK, did that. I'll let you know how it goes...

jaymzh avatar Aug 09 '23 17:08 jaymzh

It's still crashing.

BUT... I just realized, error.log has nothing of use in it - the logs I copied and pasted above, while from that date, don't seem to be the actual reason it's crashing, as that day was the only day it was logged. error.log only has this after that:

2023-06-30 18:11:51.850714+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2524952854.1901395970.103039>],dropped}
2023-07-01 22:37:47.048831+00:00 [error] <0.365.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.4180380941.2434859010.238508>],dropped}
2023-07-04 21:11:23.334292+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2021114025.138215425.185961>],dropped}
2023-07-05 20:04:17.233415+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1227170254.3035430913.28121>],dropped}
2023-07-05 21:43:06.903055+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.3898408711.634716163.122272>],dropped}
2023-07-14 15:36:32.890722+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1932254270.3659857923.247972>],dropped}
2023-07-27 22:00:42.434301+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2250007605.1785790465.240268>],dropped}
2023-08-08 20:51:28.751275+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2106956634.3650682881.157422>],dropped}

But that doesn't seem to be the issue since it fails every few minutes and that's only every few days.

In ejabberd.log, things appear to be progressing normally... logs about local routes then suddenly it's starting again:

2023-08-09 17:17:46.540490+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:324 Local route:
...
2023-08-09 17:17:46.540539+00:00 [debug] <0.539.0>@ejabberd_s2s:start_connection/3:362 Finding connection for {<<"ipom.com">>,<<"gmail.com">>}
2023-08-09 17:17:46.540567+00:00 [debug] <0.539.0>@ejabberd_s2s:choose_pid/2:412 Using ejabberd_s2s_out <0.542.0>
2023-08-09 17:17:46.540592+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:329 Sending to process <0.542.0>
2023-08-09 17:18:19.529779+00:00 [info] <0.122.0>@ejabberd_config:load/1:82 Loading configuration from /etc/ejabberd/ejabberd.yml
2023-08-09 17:18:19.539887+00:00 [notice] <0.122.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'debug'
2023-08-09 17:18:19.718257+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_mam is recommended for module mod_muc but is not found in the config
2023-08-09 17:18:19.724383+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_client_state is recommended for module mod_push_keepalive but is not found in the config
2023-08-09 17:18:19.730664+00:00 [info] <0.122.0>@ejabberd_config:load/1:89 Configuration loaded successfully
2023-08-09 17:18:19.763154+00:00 [debug] <0.73.0>@supervisor:report_progress/2:1557 PROGRESS REPORT:
...

jaymzh avatar Aug 09 '23 17:08 jaymzh

Could you also check syslog/kernlog to see if there is no mentions on beam process?

prefiks avatar Aug 09 '23 17:08 prefiks

And the journalctl -u ejabberd output should include a line such as the following if it's related to the watchdog:

ejabberd.service: Watchdog timeout (limit 30s)!

weiss avatar Aug 09 '23 17:08 weiss

Oh yeah, in syslog:

2023-08-09T17:26:36.040824+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:27:42.290679+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:05.541971+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:56.790673+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:31:18.042185+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:32:09.290852+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:33:30.540562+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:57:30.041086+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!

That isn't in the journal though, fwiw. I think that's just because I have debug enabled so the journal has cycled around.

jaymzh avatar Aug 09 '23 17:08 jaymzh

Do you get output from grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log? (That's just to double-check the patch is in effect.)

weiss avatar Aug 09 '23 18:08 weiss

Yup:

# grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log | tail -n 10
2023-08-09 17:25:20.903814+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:26:42.195441+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:27:48.360714+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:29:11.645013+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:30:02.955898+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:31:24.152766+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:32:15.387876+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:33:36.599136+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:57:36.100613+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 18:11:00.430281+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled

jaymzh avatar Aug 09 '23 18:08 jaymzh

Thanks, could you grep for ejabberd_systemd (with loglevel set to debug) and show all output between two crashes?

weiss avatar Aug 09 '23 18:08 weiss

in ejabberd.log?

2023-08-09 18:42:50.883885+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:42:51.200821+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:43:05.884344+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:20.885429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:57.141885+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:43:57.143461+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:43:57.143534+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:43:57.469138+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:44:12.144167+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:44:48.312132+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:44:48.313653+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:44:48.313731+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:44:48.600797+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:45:03.314257+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:18.315235+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:33.316238+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:09.602756+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:46:09.604392+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:46:09.604471+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:46:09.898492+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:46:24.605393+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:39.606416+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:54.607429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:47:14.288452+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1

jaymzh avatar Aug 09 '23 18:08 jaymzh

Okay, thank you. So at some point ejabberd simply stops pinging the watchdog indeed.

As @prefiks said above, you can just disable the systemd watchdog for the moment to make the issue go away.

It seems you can reproduce the issue more easily than others, so I might come back with more debugging wishes, but for the next two weeks I'm on vacation.

weiss avatar Aug 09 '23 19:08 weiss

Ah missed that. Disabled the watchdog with an override file. Thanks!

jaymzh avatar Aug 09 '23 19:08 jaymzh

Did you run an older version of ejabberd with a similar load before, without running into that issue?

weiss avatar Aug 10 '23 11:08 weiss

My 'load' is 1 client ;)

And yeah, it was fine for the last 10 years belly the upgrade.

jaymzh avatar Aug 10 '23 12:08 jaymzh

My 'load' is 1 client ;)

And the OS is bored as well?

Either way, seems you can reproduce the issue better than others, who run into it only very occasionally. Do you happen to be up to some debugging, such as me giving you another patched ejabberd_systemd.beam file for testing, and/or running some calls within an ejabberdctl debug shell while reproducing the issue? If so, you could ping me by email (or XMPP): [email protected] (and we could then figure out some way to chat directly).

weiss avatar Sep 01 '23 11:09 weiss

Yeah happy to - dropped in an updated ejabberd_systemd.beam from you above already, so happy to do more. I'll reach out.

jaymzh avatar Sep 04 '23 19:09 jaymzh