ejabberdctl restart sometimes fails
Happy New Year!
Environment
- ejabberd version: 23.10.0
- Erlang version: 14.1.5
- OS: Devuan GNU/Linux 5 (daedalus)
- Installed from: official dpkg repository from ProcessOne
Errors from error.log/crash.log
No errors
ejabberd.log
2023-12-30 03:23:04.611133+01:00 [info] <0.2280.0>@ejabberd_config:load/1:82 Loading configuration from /opt/ejabberd/conf/ejabberd.yml
2023-12-30 03:23:05.926536+01:00 [info] <0.2280.0>@ejabberd_config:load/1:89 Configuration loaded successfully
2023-12-30 03:23:06.333227+01:00 [info] <0.2537.0>@ejabberd_systemd:init/1:94 Got no NOTIFY_SOCKET, notifications disabled
2023-12-30 03:23:06.664343+01:00 [info] <0.2536.0>@gen_mod:start_modules/0:141 Loading modules for zauber.dynv6.net
2023-12-30 03:23:06.887456+01:00 [info] <0.2631.0>@mod_stun_disco:parse_listener/1:616 Going to offer STUN/TURN service: a.b.c.d:3478 (udp)
2023-12-30 03:23:07.338042+01:00 [info] <0.2689.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while
Bug description
After running sudo ejabberdctl restart ejabberd has written the log seen above and crashed. The ejabberd server was not available that day. There were no error messages and I could not find any information about what went wrong.
Later, after a manual sudo service ejabberd start ejabberd started normally and has written the following lines to the log:
[...]
2024-01-01 07:52:48.997734+01:00 [info] <0.591.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while
2024-01-01 07:52:49.288450+01:00 [info] <0.179.0>@ejabberd_cluster_mnesia:wait_for_sync/1:123 Waiting for Mnesia synchronization to complete
2024-01-01 07:52:49.351047+01:00 [warning] <0.472.0>@ejabberd_pkix:log_warnings/1:393 Invalid certificate in /etc/letsencrypt/live/zauber.dynv6.net/fullchain.pem: at line 58: certificate is signed by unknown CA
2024-01-01 07:52:49.460733+01:00 [info] <0.179.0>@ejabberd_app:start/2:63 ejabberd 23.10.0 is started in the node ejabberd@localhost in 3.41s
Why could sudo ejabberdctl restart bring the server down? What could make ejabberd stuck/crash when "Building MQTT cache"?
Is this with loglevel: info ?
Can you try this fix https://github.com/processone/ejabberd/issues/4109#issuecomment-1779209127 ?
Yes, I had loglevel: info. I have now changed it to loglevel: debug.
I am closing this as duplicate for now. If this issue still persists after the new commits, I will reopen it.
I set EJABBERD_OPTS="" in ejabberdctl.cfg but ejabberd still crashed on startup.
ejabberd.yml
include_config_file:
/var/www/html/ipa.yml:
# 'SELF_IPV4A'
# 'SELF_IPV6A'
allow_only:
- "define_macro"
/opt/ejabberd/conf/ejabberd-secret-credentials.yml:
# 'SQL_PASSWORD'
allow_only:
- "define_macro"
hosts:
- zauber.dynv6.net
hide_sensitive_log_data: true
loglevel: debug
sql_type: mysql
sql_server: "localhost"
sql_database: "ejabberd"
sql_username: "ejabberd"
sql_password: 'SQL_PASSWORD'
sql_prepared_statements: true
auth_method: [sql]
default_db: sql
# update_sql_schema: true
certfiles:
- /etc/letsencrypt/live/zauber.dynv6.net/fullchain.pem
- /etc/letsencrypt/live/zauber.dynv6.net/privkey.pem
listen:
-
port: 5222
ip: "::"
module: ejabberd_c2s
max_stanza_size: 262144
shaper: c2s_shaper
access: c2s
starttls_required: true
-
port: 5223
ip: "::"
module: ejabberd_c2s
max_stanza_size: 262144
shaper: c2s_shaper
access: c2s
tls: true
-
port: 5269
ip: "::"
module: ejabberd_s2s_in
max_stanza_size: 524288
shaper: s2s_shaper
-
port: 5443
ip: "::"
module: ejabberd_http
tls: true
request_handlers:
/admin: ejabberd_web_admin
/api: mod_http_api
/bosh: mod_bosh
/captcha: ejabberd_captcha
/upload: mod_http_upload
/ws: ejabberd_http_ws
-
port: 5280
ip: "::"
module: ejabberd_http
request_handlers:
/admin: ejabberd_web_admin
# /.well-known/acme-challenge: ejabberd_acme
-
port: 3478
ip: "::"
transport: udp
module: ejabberd_stun
use_turn: true
## The server's public IPv4 address:
turn_ipv4_address: 'SELF_IPV4A'
## The server's public IPv6 address:
# turn_ipv6_address: "2001:db8::3"
-
port: 1883
ip: "::"
module: mod_mqtt
backlog: 1000
acme:
auto: false
s2s_use_starttls: optional
captcha_cmd: /usr/local/bin/captcha.sh
captcha_url: auto
acl:
admin:
user: [email protected]
local:
user_regexp: ""
loopback:
ip:
- 127.0.0.0/8
- ::1/128
access_rules:
local:
allow: local
c2s:
deny: blocked
allow: all
announce:
allow: admin
configure:
allow: admin
muc_create:
allow: local
pubsub_createnode:
allow: local
trusted_network:
allow: loopback
api_permissions:
"console commands":
from:
- ejabberd_ctl
who: all
what: "*"
"admin access":
who:
access:
allow:
- acl: loopback
- acl: admin
oauth:
scope: "ejabberd:admin"
access:
allow:
- acl: loopback
- acl: admin
what:
- "*"
- "!stop"
- "!start"
"public commands":
who:
ip: 127.0.0.1/8
what:
- status
- connected_users_number
shaper:
normal:
rate: 3000
burst_size: 20000
fast: 100000
shaper_rules:
max_user_sessions: 10
max_user_offline_messages:
5000: admin
100: all
c2s_shaper:
none: admin
normal: all
s2s_shaper: fast
modules:
mod_adhoc: {}
mod_admin_extra: {}
mod_announce:
access: announce
mod_avatar: {}
mod_blocking: {}
mod_bosh: {}
mod_caps: {}
mod_carboncopy: {}
mod_client_state: {}
mod_configure: {}
mod_disco: {}
mod_fail2ban: {}
mod_http_api: {}
mod_http_upload:
put_url: https://@HOST@:5443/upload
custom_headers:
"Access-Control-Allow-Origin": "https://@HOST@"
"Access-Control-Allow-Methods": "GET,HEAD,PUT,OPTIONS"
"Access-Control-Allow-Headers": "Content-Type"
mod_last: {}
mod_mam:
db_type: sql
assume_mam_usage: true
default: always
mod_mqtt: {}
mod_muc:
access:
- allow
access_admin:
- allow: admin
access_create: muc_create
access_persistent: muc_create
access_mam:
- allow
default_room_options:
mam: true
mod_muc_admin: {}
mod_offline:
access_max_user_messages: max_user_offline_messages
mod_ping: {}
mod_privacy: {}
mod_private: {}
mod_proxy65:
access: local
max_connections: 5
mod_pubsub:
access_createnode: pubsub_createnode
plugins:
- flat
- pep
force_node_config:
## Avoid buggy clients to make their bookmarks public
storage:bookmarks:
access_model: whitelist
mod_push: {}
mod_push_keepalive: {}
mod_register:
captcha_protected: true
mod_roster:
versioning: true
mod_s2s_dialback: {}
mod_shared_roster: {}
mod_stream_mgmt:
resend_on_timeout: if_offline
mod_stun_disco: {}
mod_vcard: {}
mod_vcard_xupdate: {}
mod_version:
show_os: false
ejabberd.log ends with
2024-01-04 03:17:15.185020+01:00 [info] <0.1870.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while
2024-01-04 03:17:15.186906+01:00 [debug] <0.1790.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,ejabberd_gen_mod_sup}
started: [{pid,<0.1870.0>},
{id,'mod_mqtt_zauber.dynv6.net'},
{mfargs,{gen_server,start_link,
[{local,'mod_mqtt_zauber.dynv6.net'},
mod_mqtt,
[<<"zauber.dynv6.net">>,
#{cache_size => 1000,max_queue => 5000,
cache_missed => true,db_type => sql,
queue_type => ram,
cache_life_time => 3600000,
use_cache => true,ram_db_type => mnesia,
max_topic_depth => 8,
match_retained_limit => 1000,
session_expiry => 300000,
access_publish => [],
access_subscribe => [],
max_topic_aliases => 100}],
[]]}},
{restart_type,transient},
{significant,false},
{shutdown,60000},
{child_type,worker}]
2024-01-04 03:17:15.187762+01:00 [debug] <0.1717.0>@gen_mod:start_module/4:164 Loading mod_proxy65 at zauber.dynv6.net
2024-01-04 03:17:15.208536+01:00 [debug] <0.1743.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,ejabberd_listener}
started: [{pid,<0.1874.0>},
{id,{7777,{192,168,178,55},tcp}},
{mfargs,
{ejabberd_listener,start,
[{7777,{192,168,178,55},tcp},
mod_proxy65_stream,
#{access => local,name => <<"SOCKS5 Bytestreams">>,
port => 7777,recbuf => 65536,send_timeout => 15000,
sndbuf => 65536,hosts => [],
host => <<"proxy.zauber.dynv6.net">>,
ip => {192,168,178,55},
supervisor => true,hostname => undefined,
backlog => 128,unix_socket => #{},shaper => none,
transport => tcp,max_connections => 5,
ram_db_type => mnesia,accept_interval => 0,
use_proxy_protocol => false,
server_host => <<"zauber.dynv6.net">>,
vcard => undefined,auth_type => anonymous}]}},
{restart_type,transient},
{significant,false},
{shutdown,brutal_kill},
{child_type,worker}]
2024-01-04 03:17:15.209713+01:00 [debug] <0.1758.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,ejabberd_backend_sup}
started: [{pid,<0.1875.0>},
{id,mod_proxy65_mnesia},
{mfargs,{mod_proxy65_mnesia,start_link,[]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
2024-01-04 03:17:15.210396+01:00 [debug] <0.1877.0>@ejabberd_router:register_route/4:174 Route registered: proxy.zauber.dynv6.net
2024-01-04 03:17:15.210786+01:00 [debug] <0.1876.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,'ejabberd_mod_proxy65_zauber.dynv6.net'}
started: [{pid,<0.1877.0>},
{id,mod_proxy65_service},
{mfargs,
{mod_proxy65_service,start_link,[<<"zauber.dynv6.net">>]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
2024-01-04 03:17:15.211253+01:00 [debug] <0.1790.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,ejabberd_gen_mod_sup}
started: [{pid,<0.1876.0>},
{id,'ejabberd_mod_proxy65_zauber.dynv6.net'},
{mfargs,{mod_proxy65,start_link,[<<"zauber.dynv6.net">>]}},
{restart_type,transient},
{significant,false},
{shutdown,infinity},
{child_type,supervisor}]
2024-01-04 03:17:15.211649+01:00 [debug] <0.1717.0>@supervisor:report_progress/2:1565 PROGRESS REPORT:
supervisor: {local,ejabberd_sup}
started: [{pid,<0.1790.0>},
{id,ejabberd_gen_mod_sup},
{mfargs,{gen_mod,start_link,[]}},
{restart_type,permanent},
{significant,false},
{shutdown,infinity},
{child_type,supervisor}]
I noticed this only happens sometimes with ejabberdctl restart. ejabberd starts fine with service ejabberd restart which internally does ejabberdctl stop and ejabberdctl start.
I can reproduce the problem (or at least I think it's the same problem) one of every two or three restarts.
The problem appears when using any binary installer from master branch, 23.10, 23.04, 22.05
When running ejabberdctl restart, the functions called are:
ejabberd_app:startejabberd_hooks:run(ejabberd_started, [])ejabberd_pkix:ejabberd_started()fast_tls:clear_cache()fast_tls:clear_cache_nif()- and this goes to the fast_tls C code that crashes
It is very easy to reproduce:
- Install ejabberd using binary installers (master branch or any recent release)
- Start that ejabberd. If you use "ejabberdctl live", you will be able to see its behaviour in realtime
- Call
ejabberdctl restart - If it restarts correctly, call it again. After 2 or 3 restarts, the problem should appear.
The problem does not appear when using the container, or ejabberd compiled from source code with Erlang 26.1 (the one used in the installers). In those cases, ejabberdctl restart successfully restart ejabberd everytime.
Is this seen with OpenSSL 3, and when no dhfile is configured for listeners in ejabberd.yml? if so, this may be fixed by https://github.com/processone/fast_tls/pull/63
The problem is reproducible when using binary installers (23.10, older releases, also master release). Those installers are built by tools/make-binaries which downloads and compiles openssl 1.1.1w
https://github.com/processone/ejabberd/blob/66d701e788af7d1f86c253481ab2d8411bc92a3c/tools/make-binaries#L73
As you mentioned, fast_tls included in git some fixes. It's now tagged and used by ejabberd, but testing installers that use latest fast_tls throws the same problem.
I guess it's worth trying to update binary installers to OpenSSL 3, but unfortunately I get compilation problems in that case.
@badlop: OpenSSL 1.1.1 branch is now EOL, yes please look for OpenSSL 3.2.x:
- https://www.openssl.org/blog/blog/2023/09/11/eol-111/
- https://www.openssl.org/blog/blog/2023/11/23/OpenSSL32/
- https://www.openssl.org/news/openssl-3.2-notes.html
The newest ejabberd installers use OpenSSL 3.2.1 (since 1962fc88d62ede7676bf7f91f72094ef3f105839) and fast_tls 1.1.18, which includes the fix https://github.com/processone/fast_tls/commit/da16622da621eb4d74f047e8ac376d701782007b. Those installers can be downloaded from https://github.com/processone/ejabberd/actions/runs/7844499079
However, the problem mentioned in this issue is still present.