ejabberd
ejabberd copied to clipboard
MUC rooms in error state
Environment
- ejabberd version: 22.05
- Erlang version: 12.3.1
- OS: Linux (Debian)
- Installed from: source
Errors from error.log/crash.log
2022-08-05 13:46:15.144600+05:30 [info] <0.14919.7002>@mod_http_api:log/3:511 API call get_room_options [{<<"name">>,
<<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
{<<"service">>,<<"conference.xxx.xxx.xxx">>}] from ::ffff:127.0.0.1:61944
2022-08-05 13:46:15.145137+05:30 [error] <0.14919.7002>@mod_http_api:handle/4:259 REST API Error: get_room_options([{<<"name">>,
<<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
{<<"service">>,
<<"conference.xxx.xxx.xxx">>}]) -> error:{badmatch,
{error,
notfound}} [{mod_muc_admin,
get_room_config,
1,
[{file,
"src/mod_muc_admin.erl"},
{line,
926}]},
{mod_muc_admin,
get_room_options,
1,
[{file,
"src/mod_muc_admin.erl"},
{line,
1237}]},
{mod_http_api,
handle2,
4,
[{file,
"src/mod_http_api.erl"},
{line,
269}]},
{mod_http_api,
handle,
4,
[{file,
"src/mod_http_api.erl"},
{line,
230}]},
{mod_http_api,
perform_call,
4,
[{file,
"src/mod_http_api.erl"},
{line,
190}]},
{mod_http_api,
process,
2,
[{file,
"src/mod_http_api.erl"},
{line,
143}]},
{ejabberd_http,
process,
2,
[{file,
"src/ejabberd_http.erl"},
{line,
373}]},
{ejabberd_http,
process_request,
1,
[{file,
"src/ejabberd_http.erl"},
{line,
496}]}]
Bug description
I am unable to get room options for an already existing room using get _room_options API. It gives error Code: 400 "Conference room does not exist" as response. The room records are available in muc_room table in PostgreSQL DB.
Now when I try to delete the room using destroy_room API, I get 0 (success) response. Then I tried creating the room again using create_room_with_opts API. And now I get the error "Room already exists"
It looks like the room is in an error state in Mnesia. I have this issue with 30 to 40 rooms. Recently I have enabled message subscription (MucSub) to rooms. How to debug/resolve the issue?
If you try with ejabberdctl ... it works?
Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.
@licaon-kter I tried with ejabberdctl. Still stuck up in this loop
ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx {error,"Room already exists"} ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx {error,"Room already exists"}
@prefiks How to check hibernation status of room? I tried the following:
ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx Unhandled exception occurred executing the command: ** exception error: no match of right hand side value {error,notfound} in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926) in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237) in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332) in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297) in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235) in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213) in call from erpc:execute_call/4 (erpc.erl, line 392)
@logicwonder can you get the exit code for the destroy command? echo $?
Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?
You should be able to clean those records with something like that:
lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).
in debug console (just replace host at end).
@licaon-kter The exit code was 0
@prefiks - This is a cluster with 3 nodes. I tried the script in the debug shell. After execution the issue is not fixed.
Erlang/OTP 24 [erts-12.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]
Eshell V12.3.1 (abort with ^G)
([email protected])1> lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.xxxx.xxxx.xxx">>)).
ok
([email protected])2>
BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
(l)oaded (v)ersion (k)ill (D)b-tables (d)istribution
^C
ejabberd@node1:~$ ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
in call from erpc:execute_call/4 (erpc.erl, line 392)
ejabberd@node1:~$ echo $?
1
ejabberd@node1:~$ ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx xxxxx.xxxx.xxx
{error,"Room already exists"}
ejabberd@node1:~$ ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx
ejabberd@node1:~$ echo $?
0
Error log with mod_muc:
2022-08-04 19:43:19.668717+05:30 [error] <0.16826.6542>@ejabberd_sql:check_error/2:1236 SQL query 'Q106720625' at {mod_muc_sql,{105,9}} failed: <<"timed out">> 2022-08-04 19:43:33.855798+05:30 [error] <0.21662.6298>@ejabberd_sql:check_error/2:1236 SQL query 'Q113641781' at {mod_muc_sql,{120,23}} failed: <<"timed out">> 2022-08-04 21:14:34.091632+05:30 [error] <0.6030.6811>@ejabberd_sql:check_error/2:1236 SQL query 'Q5148888' at {mod_muc_sql,{166,23}} failed: <<"timed out">>
This command do delete only room local to node where it was executed, if those rooms are supposed to live on other nodes, this will not fix it. It may be problem with synching that data between nodes (maybe there was problem with connection when that room was deleted on other node?). Could you see if calling any of get_room_options on different node works or returns different result?
The issue was fixed after executing the script in all nodes. get_room_options started working for these rooms in all nodes. Thanks @prefiks and @licaon-kter for your instant replies and the script to fix the issue.
Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.
I am getting the following error for a room that got hibernated. I have enabled a hibernation_timeout of 24 hrs.
> ejabberdctl get_room_options f09b2d05eb73a582 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
in call from erpc:execute_call/4 (erpc.erl, line 392)
I tried executing the command in three nodes in eJabberd cluster the same response was received.
Is this the behaviour of hibernated rooms. How can I bring the rooms out of this state?
It should be unhibernated, probably there is entry in online rooms table for that room that is no longer valid (it should be removed when room hibernates), so we don't attempt to restore it thinking it's still alive. Probably command that i posted above will fix that, but it would be nice to know why those rooms entries do still appear in that table...
The script worked. I made a mistake in service name.
My understanding was that the room process for the hiberating room shall be brought up whenever there is administration/activity on the room. Is hibernation applicable for rooms where user are subscribed to messages (mucsub) were user does not join the room for sending.receiving messages? Can you explain how room hibernating work? I couldn't find much resources.
So room will be unhibernated if someone sends message/iq/presence request or when any action is done on it from ejabberdctl (at least i think i updated all commands to do that). We don't hibernate rooms that have active participants (i.e. any where there was someone that sent <presence/> to it). We will however hibernate rooms that have only muc-sub subscribers - we will unhibernate them on any action. In general it should be transparent if room is hibernated or not (with one exception - sending disco items to muc service will not list hibernated rooms - only active rooms will be shown, ideally i would like to not have this, but we don't store info that would be required for that in easily accessible maner).
Thanks @prefiks for clarifying the concept. Could it be a bug with hibernated that the room is not waking up when I try subscribe API call or get_room_options command. I have only mucsub based room subscription without room joining, would that be related to the issue?
I started facing similar issue when trying to delete a room
ejabberd version: 22.05 Erlang version: 12.3.1 OS: Linux (Debian) Installed from: source 3 node cluster
Using PostgreSQL as SQL DB
Steps done:
- Deleted a room using the admin command ejabberdctl destroy room. There are no errors and the room record got deleted from the SQL table muc_room
- Tried to create that room again with admin command create_room the following message is shown:
{error,"Room already exists"}
- Tried running get_room_options for the room and got the following error:
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
in call from erpc:execute_call/4 (erpc.erl, line 392)
Also tried running the following snippet suggested by @prefiks in all my three nodes in cluster. But the room is still in some error state and not completely deleted. I am not finding anything in error.log
Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?
You should be able to clean those records with something like that:
lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).in debug console (just replace host at end).
Please help me in resolving the issue
Could you try executing those commands for your room and see what they return?
mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>).
rpc:pinfo(mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>)).
I got error executing these commands
([email protected])7> mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>).
** exception error: undefined function mod_muc_admin:get_room_pid/2
([email protected])6> rpc:pinfo(mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>)). ** exception error: undefined function mod_muc_admin:get_room_pid/2
Ah, right it's only available in 22.10, so let's try this:
(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"room">>, <<"conference.server">>).
Thanks for the response. I restarted one of the nodes and then I was able to delete the room. I shall try the script the next time when similar issue is faced.
Thanks for you support.
I got the following response for the above command:
{<0.17927.3300>, [{current_function,{p1_fsm,collect_messages,3}}, {initial_call,{proc_lib,init_p,5}}, {status,waiting}, {message_queue_len,0}, {links,[<0.11423.0>]}, {dictionary,[{'$ancestors',['mod_muc_room_sup_xxx.xxx.xx', 'mod_muc_sup_xxx.xxx.xx',ejabberd_gen_mod_sup, ejabberd_sup,<0.11264.0>]}, {'$internal_queue_len',0}, {rand_seed,{#{bits => 58,jump => #Fun<rand.3.92093067>, next => #Fun<rand.0.92093067>,type => exsss, uniform => #Fun<rand.1.92093067>, uniform_n => #Fun<rand.2.92093067>}, [256314510256950933|203838542858434267]}}, {'$initial_call',{mod_muc_room,init,1}}]}, {trap_exit,true}, {error_handler,error_handler}, {priority,normal}, {group_leader,<0.11263.0>}, {total_heap_size,10971}, {heap_size,4185}, {stack_size,18}, {reductions,757200}, {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}}, {min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,65535}, {minor_gcs,36}]}, {suspending,[]}]}
Hi, is there any task to do regarding this issue, or can it get closed?
I have faced the same issue when tried to run mod_muc_admin:get_room_options. I have also 3 nodes on cluster and no hibernation option defined in the .yml file (as far as I understand, it means no hibernation in any case).
@prefiks can you please explain me what that script do?
lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).
Also I need to understand why this situation occurs and how can I catch it before the end users will face it in the client side?
Most likelly, tables keeping info about active rooms got desynched between cluster nodes (probably there was problem with network between nodes when information about room changes happened, and there is stalled info about room somewhere). This script tries to sync rooms table state with what actual rooms are running.
@prefiks thanks for the response. Is there any way I can be aware about such situation before the client experiences it?
And also, regarding the network problems you have mentioned, we use AWS EKS who (as we think) not likely to have such network problems. Do you know about such problems in EKS?
Hi, I have this problem every 2 days in my 3 nodes cluster. Now there is a muc that returns error when I run mod_muc_admin:get_room_options on each node.
Any idea how to solve it? And moreover, how can I prevent this situation?
@prefiks please maybe you have any idea?
And also, when I run this:
(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"roomid">>, <<"conference.myswerver.com">>).
I get this: {<31537.5036.1>,undefined}
You could try running this on each node, it possibly could help: {code} lists:foreach(fun({R, H, P}) ->case node(P)/=node() andalso rpc:pinfo(P, status) == undefined of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)). {code}
Thanks @prefiks
But what causes this situation and how can we avoid it or at least, intercept it immediately when it occurs?
@prefiks In my experience, the inconsistency between cluster nodes is bound to happen either because of network or other reasons. Could you please suggest a cursory check (an Erlang script) that can detect such inconsistencies in a cluster. This could be of great help and may be effective to support the issues raised in future.
This is most likely result of network issues between node clusters, probably at some node got dropped from a cluster, when other nodes couldn't reach it, and it could happen that operations execute in that time aren't properly propagated when connection get restored. Hard to tell when desync like that happens, it probably would require comparing table content of all nodes but this gets tricky since those tables can change underneath. Maybe just checking if size of all tables match can work in most cases?
{code}
ejabberd_cluster:multicall(mnesia, table_info, [muc_online_room, size]).
{code}
This should return {[SizeA, SizeB, SizeC, ...], [NotReachableNodeA, ...]}, if Size* are different, then likelly tables are desynced.