mnesia crashes when the number of rooms increases
Environment
- ejabberd version: 23.01
- Erlang version: Erlang/OTP 24
- OS: docker: ejabberd/ecs
- Installed from: dockerImage: ejabberd/ecs
Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml
loglevel: debug
sql_type: mysql
new_sql_schema: false
default_db: sql
default_ram_db: mnesia
auth_method: sql
cache_size: 20000
max_fsm_queue: 30000
mod_muc:
access:
- allow
access_admin:
- allow: admin
access_create:
- allow: admin
access_persistent: muc_create
access_mam:
- allow
default_room_options:
allow_user_invites: false
allow_subscription: true
allow_change_subj: false
allow_query_users: true
allow_private_messages: true
mam: true
members_by_default: true
members_only: true
logging: true
persistent: true
anonymous: false
public: false
presence_broadcast:
- visitor
history_size: 0
max_users: 5000
max_user_conferences: 5000
mod_muc_admin: {}
...
Errors from ejabberd.log
%% A lot of lines like the following:
2024-11-17 06:35:32.258564+00:00 [debug] SQL: "select opts from muc_room where name='room_name1' and host='conference.myhost.com'"
2024-11-17 06:35:32.259911+00:00 [debug] SQL: "select jid, nick, nodes from muc_room_subscribers where room='room_name1' and host='conference.myhost.com'"
2024-11-17 06:35:32.260771+00:00 [debug] Restore room: room_name1
...
2024-11-17 06:35:59.080396+00:00 [error] ** Generic server ejabberd_s2s terminating
** Last message in was {mnesia_system_event,
{mnesia_down,'ejabberd@ejabberd-0'}}
** When Server state == {state}
** Reason for termination ==
** {{aborted,{node_not_running,'ejabberd@ejabberd-0'}},
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,362}]},
{mnesia_tm,prepare_items,5,[{file,"mnesia_tm.erl"},{line,1225}]},
{mnesia_tm,dirty,2,[{file,"mnesia_tm.erl"},{line,1067}]},
{lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
{ejabberd_commands,unregister_commands,1,
[{file,"src/ejabberd_commands.erl"},{line,160}]},
{ejabberd_s2s,terminate,2,[{file,"src/ejabberd_s2s.erl"},{line,274}]},
{gen_server,try_terminate,3,[{file,"gen_server.erl"},{line,733}]},
{gen_server,terminate,10,[{file,"gen_server.erl"},{line,918}]}]}
2024-11-17 06:35:59.081972+00:00 [notice] application: mnesia
exited: stopped
type: permanent
2024-11-17 06:35:59.080025+00:00 [error] ** Generic server 'mod_muc_mnesia_myhost.com' terminating
** Last message in was {mnesia_system_event,
{mnesia_down,'ejabberd@ejabberd-0'}}
** When Server state == {state}
** Reason for termination ==
** {{aborted,{no_exists,[muc_online_room,
[{{muc_online_room,'_','$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}},
[{mnesia_tm,non_transaction,5,[{file,"mnesia_tm.erl"},{line,753}]},
{mod_muc_mnesia,handle_info,2,
[{file,"src/mod_muc_mnesia.erl"},{line,355}]},
{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2024-11-17 06:35:59.080069+00:00 [error] ** Generic server ejabberd_sm_mnesia terminating
** Last message in was {mnesia_system_event,
{mnesia_down,'ejabberd@ejabberd-0'}}
** When Server state == {state}
** Reason for termination ==
** {badarg,[{ets,select,
[session,
[{{session,{'_','$1'},'_','_','_','_'},
[{'==',{node,'$1'},{const,'ejabberd@ejabberd-0'}}],
['$_']}]],
[{error_info,#{cause => id,module => erl_stdlib_errors}}]},
{ejabberd_sm_mnesia,handle_info,2,
[{file,"src/ejabberd_sm_mnesia.erl"},
{line,116}]},
{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2024-11-17 06:35:59.083746+00:00 [critical] Internal error of module mod_muc has occurred during start:
** Options: #{db_type => sql,preload_rooms => true,
access_create => [{allow,[{acl,admin}]}],
vcard => undefined,max_room_id => infinity,
ram_db_type => mnesia,queue_type => ram,
hibernation_timeout => infinity,min_presence_interval => 0,
user_presence_shaper => none,
host => <<"conference.myhost.com">>,
max_rooms_discoitems => 100,hosts => [],
user_message_shaper => none,
cleanup_affiliations_on_start => false,
max_room_name => infinity,max_users_admin_threshold => 5,
max_user_conferences => 5000,max_users => 5000,
access => [{allow,[{acl,all}]}],
access_persistent => muc_create,name => <<"Chatrooms">>,
max_room_desc => infinity,room_shaper => none,
regexp_room_id => <<>>,history_size => 0,
max_password => infinity,min_message_interval => 0,
max_users_presence => 1000,
access_admin => [{allow,[{acl,admin}]}],
access_register => all,
default_room_options =>
[{allow_user_invites,false},
{allow_subscription,true},
{allow_change_subj,false},
{allow_query_users,true},
{allow_private_messages,true},
{mam,true},
{members_by_default,true},
{members_only,true},
{logging,true},
{persistent,true},
{anonymous,false},
{public,false},
{presence_broadcast,[visitor]}],
access_mam => [{allow,[{acl,all}]}],
max_captcha_whitelist => infinity}
** exception exit: {aborted,{no_exists,[muc_online_room,
{<<"mucroom_1126">>,
<<"conference.myhost.com">>}]}}
in function mnesia:abort/1 (mnesia.erl, line 362)
in call from mod_muc_mnesia:find_online_room/2 (src/mod_muc_mnesia.erl, line 180)
in call from mod_muc:unhibernate_room/4 (src/mod_muc.erl, line 588)
in call from lists:foreach/2 (lists.erl, line 1342)
in call from gen_mod:start_module/4 (src/gen_mod.erl, line 155)
in call from lists:foreach/2 (lists.erl, line 1342)
in call from gen_mod:start_link/0 (src/gen_mod.erl, line 82)
in call from supervisor:do_start_child_i/3 (supervisor.erl, line 414)
2024-11-17 06:35:59.084366+00:00 [critical] ejabberd initialization was aborted because a module start failed.
2024-11-17 06:35:59.088199+00:00 [error] CRASH REPORT:
crasher:
initial call: mod_muc_mnesia:init/1
pid: <0.555.0>
registered_name: 'mod_muc_mnesia_myhost.com'
exception exit: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,'_','$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
in function mnesia_tm:non_transaction/5 (mnesia_tm.erl, line 753)
in call from mod_muc_mnesia:handle_info/2 (src/mod_muc_mnesia.erl, line 355)
in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
ancestors: [ejabberd_backend_sup,ejabberd_sup,<0.189.0>]
message_queue_len: 0
messages: []
links: [<0.498.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 10958
stack_size: 28
reductions: 20062
neighbours:
2024-11-17 06:35:59.087724+00:00 [error] CRASH REPORT:
crasher:
initial call: ejabberd_s2s:init/1
pid: <0.525.0>
registered_name: ejabberd_s2s
exception exit: {aborted,{node_not_running,'ejabberd@ejabberd-0'}}
in function mnesia:abort/1 (mnesia.erl, line 362)
in call from mnesia_tm:prepare_items/5 (mnesia_tm.erl, line 1225)
in call from mnesia_tm:dirty/2 (mnesia_tm.erl, line 1067)
in call from lists:foreach/2 (lists.erl, line 1342)
in call from ejabberd_commands:unregister_commands/1 (src/ejabberd_commands.erl, line 160)
in call from ejabberd_s2s:terminate/2 (src/ejabberd_s2s.erl, line 274)
in call from gen_server:try_terminate/3 (gen_server.erl, line 733)
in call from gen_server:terminate/10 (gen_server.erl, line 918)
ancestors: [ejabberd_sup,<0.189.0>]
message_queue_len: 0
messages: []
links: [<0.452.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6772
stack_size: 28
reductions: 21783
neighbours:
2024-11-17 06:35:59.090564+00:00 [error] SUPERVISOR REPORT:
supervisor: {local,ejabberd_backend_sup}
errorContext: child_terminated
reason: {aborted,{no_exists,[muc_online_room,
[{{muc_online_room,'_','$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
offender: [{pid,<0.555.0>},
{id,mod_muc_mnesia},
{mfargs,
{mod_muc_mnesia,start_link,
[<<"myhost.com">>,
#{db_type => sql,preload_rooms => true,
access_create => [{allow,[{acl,admin}]}],
vcard => undefined,max_room_id => infinity,
ram_db_type => mnesia,queue_type => ram,
hibernation_timeout => infinity,
min_presence_interval => 0,
user_presence_shaper => none,
host => <<"conference.myhost.com">>,
max_rooms_discoitems => 100,
hosts => [<<"conference.myhost.com">>],
user_message_shaper => none,
cleanup_affiliations_on_start => false,
max_room_name => infinity,
max_users_admin_threshold => 5,
max_user_conferences => 5000,max_users => 5000,
access => [{allow,[{acl,all}]}],
access_persistent => muc_create,
name => <<"Chatrooms">>,max_room_desc => infinity,
room_shaper => none,regexp_room_id => <<>>,
history_size => 0,max_password => infinity,
min_message_interval => 0,
max_users_presence => 1000,
access_admin => [{allow,[{acl,admin}]}],
access_register => all,
default_room_options =>
[{allow_user_invites,false},
{allow_subscription,true},
{allow_change_subj,false},
{allow_query_users,true},
{allow_private_messages,true},
{mam,true},
{members_by_default,true},
{members_only,true},
{logging,true},
{persistent,true},
{anonymous,false},
{public,false},
{presence_broadcast,[visitor]}],
access_mam => [{allow,[{acl,all}]}],
max_captcha_whitelist => infinity}]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
2024-11-17 06:35:59.092607+00:00 [error] SUPERVISOR REPORT:
supervisor: {local,ejabberd_backend_sup}
errorContext: start_error
reason: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,
{'_',<<"conference.myhost.com">>},
'$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
offender: [{pid,<0.555.0>},
{id,mod_muc_mnesia},
{mfargs,
{mod_muc_mnesia,start_link,
[<<"myhost.com">>,
#{db_type => sql,preload_rooms => true,
access_create => [{allow,[{acl,admin}]}],
vcard => undefined,max_room_id => infinity,
ram_db_type => mnesia,queue_type => ram,
hibernation_timeout => infinity,
min_presence_interval => 0,
user_presence_shaper => none,
host => <<"conference.myhost.com">>,
max_rooms_discoitems => 100,
hosts => [<<"conference.myhost.com">>],
user_message_shaper => none,
cleanup_affiliations_on_start => false,
max_room_name => infinity,
max_users_admin_threshold => 5,
max_user_conferences => 5000,max_users => 5000,
access => [{allow,[{acl,all}]}],
access_persistent => muc_create,
name => <<"Chatrooms">>,max_room_desc => infinity,
room_shaper => none,regexp_room_id => <<>>,
history_size => 0,max_password => infinity,
min_message_interval => 0,
max_users_presence => 1000,
access_admin => [{allow,[{acl,admin}]}],
access_register => all,
default_room_options =>
[{allow_user_invites,false},
{allow_subscription,true},
{allow_change_subj,false},
{allow_query_users,true},
{allow_private_messages,true},
{mam,true},
{members_by_default,true},
{members_only,true},
{logging,true},
{persistent,true},
{anonymous,false},
{public,false},
{presence_broadcast,[visitor]}],
access_mam => [{allow,[{acl,all}]}],
max_captcha_whitelist => infinity}]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
2024-11-17 06:35:59.097723+00:00 [error] SUPERVISOR REPORT:
supervisor: {local,ejabberd_backend_sup}
errorContext: start_error
reason: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,
{'_',<<"conference.myhost.com">>},
'$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
offender: [{pid,{restarting,<0.555.0>}},
{id,mod_muc_mnesia},
{mfargs,
{mod_muc_mnesia,start_link,
[<<"myhost.com">>,
#{db_type => sql,preload_rooms => true,
access_create => [{allow,[{acl,admin}]}],
vcard => undefined,max_room_id => infinity,
ram_db_type => mnesia,queue_type => ram,
hibernation_timeout => infinity,
min_presence_interval => 0,
user_presence_shaper => none,
host => <<"conference.myhost.com">>,
max_rooms_discoitems => 100,
hosts => [<<"conference.myhost.com">>],
user_message_shaper => none,
cleanup_affiliations_on_start => false,
max_room_name => infinity,
max_users_admin_threshold => 5,
max_user_conferences => 5000,max_users => 5000,
access => [{allow,[{acl,all}]}],
access_persistent => muc_create,
name => <<"Chatrooms">>,max_room_desc => infinity,
room_shaper => none,regexp_room_id => <<>>,
history_size => 0,max_password => infinity,
min_message_interval => 0,
max_users_presence => 1000,
access_admin => [{allow,[{acl,admin}]}],
access_register => all,
default_room_options =>
[{allow_user_invites,false},
{allow_subscription,true},
{allow_change_subj,false},
{allow_query_users,true},
{allow_private_messages,true},
{mam,true},
{members_by_default,true},
{members_only,true},
{logging,true},
{persistent,true},
{anonymous,false},
{public,false},
{presence_broadcast,[visitor]}],
access_mam => [{allow,[{acl,all}]}],
max_captcha_whitelist => infinity}]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
2024-11-17 06:35:59.092770+00:00 [error] CRASH REPORT:
crasher:
initial call: mod_muc_mnesia:init/1
pid: <0.6903.0>
registered_name: []
exception exit: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,
{'_',<<"conference.myhost.com">>},
'$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
in function mnesia_tm:non_transaction/5 (mnesia_tm.erl, line 753)
in call from lists:foreach/2 (lists.erl, line 1342)
in call from mod_muc_mnesia:init/1 (src/mod_muc_mnesia.erl, line 336)
in call from gen_server:init_it/2 (gen_server.erl, line 423)
in call from gen_server:init_it/6 (gen_server.erl, line 390)
ancestors: [ejabberd_backend_sup,ejabberd_sup,<0.189.0>]
message_queue_len: 0
messages: []
links: [<0.498.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 987
stack_size: 28
reductions: 288
neighbours:
2024-11-17 06:35:59.099375+00:00 [error] CRASH REPORT:
crasher:
initial call: mod_muc_mnesia:init/1
pid: <0.6904.0>
registered_name: []
exception exit: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,
{'_',<<"conference.myhost.com">>},
'$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
in function mnesia_tm:non_transaction/5 (mnesia_tm.erl, line 753)
in call from lists:foreach/2 (lists.erl, line 1342)
in call from mod_muc_mnesia:init/1 (src/mod_muc_mnesia.erl, line 336)
in call from gen_server:init_it/2 (gen_server.erl, line 423)
in call from gen_server:init_it/6 (gen_server.erl, line 390)
ancestors: [ejabberd_backend_sup,ejabberd_sup,<0.189.0>]
message_queue_len: 0
messages: []
links: [<0.498.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 610
stack_size: 28
reductions: 225
neighbours:
2024-11-17 06:35:59.099891+00:00 [error] SUPERVISOR REPORT:
supervisor: {local,ejabberd_backend_sup}
errorContext: start_error
reason: {aborted,
{no_exists,
[muc_online_room,
[{{muc_online_room,
{'_',<<"conference.myhost.com">>},
'$1'},
[{'==',{node,'$1'},'ejabberd@ejabberd-0'}],
['$_']}]]}}
offender: [{pid,{restarting,<0.555.0>}},
{id,mod_muc_mnesia},
{mfargs,
{mod_muc_mnesia,start_link,
[<<"myhost.com">>,
#{db_type => sql,preload_rooms => true,
access_create => [{allow,[{acl,admin}]}],
vcard => undefined,max_room_id => infinity,
ram_db_type => mnesia,queue_type => ram,
hibernation_timeout => infinity,
min_presence_interval => 0,
user_presence_shaper => none,
host => <<"conference.myhost.com">>,
max_rooms_discoitems => 100,
hosts => [<<"conference.myhost.com">>],
user_message_shaper => none,
cleanup_affiliations_on_start => false,
max_room_name => infinity,
max_users_admin_threshold => 5,
max_user_conferences => 5000,max_users => 5000,
access => [{allow,[{acl,all}]}],
access_persistent => muc_create,
name => <<"Chatrooms">>,max_room_desc => infinity,
room_shaper => none,regexp_room_id => <<>>,
history_size => 0,max_password => infinity,
min_message_interval => 0,
max_users_presence => 1000,
access_admin => [{allow,[{acl,admin}]}],
access_register => all,
default_room_options =>
[{allow_user_invites,false},
{allow_subscription,true},
{allow_change_subj,false},
{allow_query_users,true},
{allow_private_messages,true},
{mam,true},
{members_by_default,true},
{members_only,true},
{logging,true},
{persistent,true},
{anonymous,false},
{public,false},
{presence_broadcast,[visitor]}],
access_mam => [{allow,[{acl,all}]}],
max_captcha_whitelist => infinity}]}},
{restart_type,transient},
{significant,false},
{shutdown,5000},
{child_type,worker}]
Bug description
We have an ejabberd server running in a Kubernetes cluster with two pods (4 CPUs, 8GB RAM each). The server crashes when processing a large number of rooms, either with long subjects or no subjects. These rooms were created by copying existing rooms with different names in the MySQL database.
Steps to Reproduce:
- Deploy ejabberd on 2 K8s pods (4 CPUs, 8GB RAM each).
- Copy existing rooms in MySQL, changing only room names. In case 1, rooms have a long subject text, in case 2 no subject.
- In case 1 we have 2000~ rooms. In case 2 we have 5000~ rooms
- Crash occurs after loading 500~ rooms, with no consistent threshold.
- Issue persists with both long-subject rooms and rooms with no subject.
- Pod crashes after a varying number of rooms, suggesting an issue with room handling.
- Logs indicate a potential memory or room processing issue.
Is there any idea what is this bug related to or how to solve it?
Latest version is 24.10 fyi
@licaon-kter I know but we have not updated it yet (we have plugins that should be compatible with the update). Is this problem related to the version?
Deploy ejabberd on 2 K8s pods (4 CPUs, 8GB RAM each).
Do you mean you have a cluster of two erlang nodes? What happens if you use just 1 erlang node?
Copy existing rooms in MySQL, changing only room names. In case 1, rooms have a long subject text, in case 2 no subject.
What do you mean by "copy"? How do you "copy" rooms to the database? Can you provide an example *.sql file that we can import and reproduce the problem?
@badlop yes we have a of 2 pods. In case of 1 pod it works better but also crashes after increase the muc amout.
The sql query we tried is:
insert into muc_room(
select * from (
select concat('grp', right(md5(uuid()), 6), '_567') as name, host, opts, '1980-01-02 00:00:00' from muc_room where name like 'grp%' limit 200
) as names2
);
The reason we did it because we found out that in some cases after we have a lot of rooms (4000~), the pod fell and couldn't recover
Unfortunately, looking that the error messages, I am not able to determine what exactly the problem is in your case. For that I reason I wanted to reproduce your crash in a controlled test environment.
I start a container running ecs:23.01, setup mysql database, create 9000 rooms (default options: persistent:true), restart the container, and ejabberd starts correctly. Your other configured options should not have any effect in the problem.
Whatever the problem you have, you did not yet provide enough information to reproduce it. Maybe the problem is not in the raw number of rooms, but in the number of rooms and something that each room does.
If you isolate the ejabberd container from the network, so no client can connect to it, does it still crash at startup?
The sql query we tried is:
This does not create the rooms, right? This only modifies your existing rooms.
I can only test with very simple and small rooms that I create myself. Maybe the problem is triggered by rooms with more content stored, like the ones you have?
max_users: 5000 max_user_conferences: 5000
Max 5000 users in a room? and a user in 5000 different rooms?
You are not using MUC for humans chatting, right? You are using MUC for clients interchanging information, right? I wonder what unexpected behaviours those clients may perform in MUC.
I saw that for muc rooms with a few members affiliated, I could add about 26000 rooms in the DB.
It seems like as long as the opts column contains more data, the mnesia crashes faster.
Regardin the following paragraph:
This does not create the rooms, right? This only modifies your existing rooms.
When I add a new row in the muc_room table, the ejabberd server loads it as a new room to the memory during the initializing process of the server (with the default parameter preload_rooms set true. When I change this parameter to false, the server starting properly but I am doubted if this is the correct choice because out clients usually using the MucSub instead of presence in the muc that might leave the rooms hibernated and unavailable for editing using the REST API).
Rgarding the 5000 users in the rooms, most of out rooms are with 100-200 members but there are a few special rooms that hosting thousands of members.
@badlop I hope that the explanation was clear enough to understand the situation.
I couldn't find any useful solution. Di you somehow solve the problem, or at least find a workaround?