ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

MUC rooms in error state

Open logicwonder opened this issue 3 years ago • 28 comments

Environment

  • ejabberd version: 22.05
  • Erlang version: 12.3.1
  • OS: Linux (Debian)
  • Installed from: source

Errors from error.log/crash.log

2022-08-05 13:46:15.144600+05:30 [info] <0.14919.7002>@mod_http_api:log/3:511 API call get_room_options [{<<"name">>,
                            <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                           {<<"service">>,<<"conference.xxx.xxx.xxx">>}] from ::ffff:127.0.0.1:61944
2022-08-05 13:46:15.145137+05:30 [error] <0.14919.7002>@mod_http_api:handle/4:259 REST API Error: get_room_options([{<<"name">>,
                                   <<"717e5145-e8be-46b0-8e42-9cc7974fb1f7">>},
                                  {<<"service">>,
                                   <<"conference.xxx.xxx.xxx">>}]) -> error:{badmatch,
                                                                                  {error,
                                                                                   notfound}} [{mod_muc_admin,
                                                                                                get_room_config,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  926}]},
                                                                                               {mod_muc_admin,
                                                                                                get_room_options,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/mod_muc_admin.erl"},
                                                                                                 {line,
                                                                                                  1237}]},
                                                                                               {mod_http_api,
                                                                                                handle2,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  269}]},
                                                                                               {mod_http_api,
                                                                                                handle,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  230}]},
                                                                                               {mod_http_api,
                                                                                                perform_call,
                                                                                                4,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  190}]},
                                                                                               {mod_http_api,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/mod_http_api.erl"},
                                                                                                 {line,
                                                                                                  143}]},
                                                                                               {ejabberd_http,
                                                                                                process,
                                                                                                2,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  373}]},
                                                                                               {ejabberd_http,
                                                                                                process_request,
                                                                                                1,
                                                                                                [{file,
                                                                                                  "src/ejabberd_http.erl"},
                                                                                                 {line,
                                                                                                  496}]}]

Bug description

I am unable to get room options for an already existing room using get _room_options API. It gives error Code: 400 "Conference room does not exist" as response. The room records are available in muc_room table in PostgreSQL DB.

Now when I try to delete the room using destroy_room API, I get 0 (success) response. Then I tried creating the room again using create_room_with_opts API. And now I get the error "Room already exists"

It looks like the room is in an error state in Mnesia. I have this issue with 30 to 40 rooms. Recently I have enabled message subscription (MucSub) to rooms. How to debug/resolve the issue?

logicwonder avatar Aug 05 '22 10:08 logicwonder

If you try with ejabberdctl ... it works?

licaon-kter avatar Aug 05 '22 10:08 licaon-kter

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

prefiks avatar Aug 05 '22 10:08 prefiks

@licaon-kter I tried with ejabberdctl. Still stuck up in this loop

ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx {error,"Room already exists"} ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx xxxx.xxx.xxx {error,"Room already exists"}

@prefiks How to check hibernation status of room? I tried the following:

ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxx.xxx.xxx Unhandled exception occurred executing the command: ** exception error: no match of right hand side value {error,notfound} in function mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926) in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237) in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332) in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297) in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235) in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213) in call from erpc:execute_call/4 (erpc.erl, line 392)

logicwonder avatar Aug 05 '22 14:08 logicwonder

@logicwonder can you get the exit code for the destroy command? echo $?

licaon-kter avatar Aug 05 '22 14:08 licaon-kter

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

in debug console (just replace host at end).

prefiks avatar Aug 05 '22 14:08 prefiks

@licaon-kter The exit code was 0

@prefiks - This is a cluster with 3 nodes. I tried the script in the debug shell. After execution the issue is not fixed.

Erlang/OTP 24 [erts-12.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V12.3.1  (abort with ^G)

([email protected])1> lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.xxxx.xxxx.xxx">>)).
ok
([email protected])2> 
BREAK: (a)bort (A)bort with dump (c)ontinue (p)roc info (i)nfo
       (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution
^C
ejabberd@node1:~$ ejabberdctl get_room_options 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)
ejabberd@node1:~$ echo $?
1
ejabberd@node1:~$ ejabberdctl create_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx xxxxx.xxxx.xxx
{error,"Room already exists"}
ejabberd@node1:~$ ejabberdctl destroy_room 717e5145-e8be-46b0-8e42-9cc7974fb1f7 conference.xxxxx.xxxx.xxx 
ejabberd@node1:~$ echo $?
0

Error log with mod_muc:

2022-08-04 19:43:19.668717+05:30 [error] <0.16826.6542>@ejabberd_sql:check_error/2:1236 SQL query 'Q106720625' at {mod_muc_sql,{105,9}} failed: <<"timed out">> 2022-08-04 19:43:33.855798+05:30 [error] <0.21662.6298>@ejabberd_sql:check_error/2:1236 SQL query 'Q113641781' at {mod_muc_sql,{120,23}} failed: <<"timed out">> 2022-08-04 21:14:34.091632+05:30 [error] <0.6030.6811>@ejabberd_sql:check_error/2:1236 SQL query 'Q5148888' at {mod_muc_sql,{166,23}} failed: <<"timed out">>

logicwonder avatar Aug 05 '22 16:08 logicwonder

This command do delete only room local to node where it was executed, if those rooms are supposed to live on other nodes, this will not fix it. It may be problem with synching that data between nodes (maybe there was problem with connection when that room was deleted on other node?). Could you see if calling any of get_room_options on different node works or returns different result?

prefiks avatar Aug 05 '22 16:08 prefiks

The issue was fixed after executing the script in all nodes. get_room_options started working for these rooms in all nodes. Thanks @prefiks and @licaon-kter for your instant replies and the script to fix the issue.

logicwonder avatar Aug 06 '22 05:08 logicwonder

Do you have hibernate option set in mod_muc (and to some low value)? I think it may be that room is hibernated in time between when we locate where it lives, and when we send a query to it.

I am getting the following error for a room that got hibernated. I have enabled a hibernation_timeout of 24 hrs.

> ejabberdctl get_room_options f09b2d05eb73a582 conference.xxxxx.xxxx.xxx
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392)

I tried executing the command in three nodes in eJabberd cluster the same response was received.

Is this the behaviour of hibernated rooms. How can I bring the rooms out of this state?

logicwonder avatar Aug 16 '22 12:08 logicwonder

It should be unhibernated, probably there is entry in online rooms table for that room that is no longer valid (it should be removed when room hibernates), so we don't attempt to restore it thinking it's still alive. Probably command that i posted above will fix that, but it would be nice to know why those rooms entries do still appear in that table...

prefiks avatar Aug 16 '22 12:08 prefiks

The script worked. I made a mistake in service name.

My understanding was that the room process for the hiberating room shall be brought up whenever there is administration/activity on the room. Is hibernation applicable for rooms where user are subscribed to messages (mucsub) were user does not join the room for sending.receiving messages? Can you explain how room hibernating work? I couldn't find much resources.

logicwonder avatar Aug 16 '22 13:08 logicwonder

So room will be unhibernated if someone sends message/iq/presence request or when any action is done on it from ejabberdctl (at least i think i updated all commands to do that). We don't hibernate rooms that have active participants (i.e. any where there was someone that sent <presence/> to it). We will however hibernate rooms that have only muc-sub subscribers - we will unhibernate them on any action. In general it should be transparent if room is hibernated or not (with one exception - sending disco items to muc service will not list hibernated rooms - only active rooms will be shown, ideally i would like to not have this, but we don't store info that would be required for that in easily accessible maner).

prefiks avatar Aug 16 '22 13:08 prefiks

Thanks @prefiks for clarifying the concept. Could it be a bug with hibernated that the room is not waking up when I try subscribe API call or get_room_options command. I have only mucsub based room subscription without room joining, would that be related to the issue?

logicwonder avatar Aug 16 '22 14:08 logicwonder

I started facing similar issue when trying to delete a room

ejabberd version: 22.05 Erlang version: 12.3.1 OS: Linux (Debian) Installed from: source 3 node cluster

Using PostgreSQL as SQL DB

Steps done:

  1. Deleted a room using the admin command ejabberdctl destroy room. There are no errors and the room record got deleted from the SQL table muc_room
  2. Tried to create that room again with admin command create_room the following message is shown:
{error,"Room already exists"}
  1. Tried running get_room_options for the room and got the following error:
Unhandled exception occurred executing the command:
** exception error: no match of right hand side value {error,notfound}
   in function  mod_muc_admin:get_room_config/1 (src/mod_muc_admin.erl, line 926)
   in call from mod_muc_admin:get_room_options/1 (src/mod_muc_admin.erl, line 1237)
   in call from ejabberd_ctl:call_command/4 (src/ejabberd_ctl.erl, line 332)
   in call from ejabberd_ctl:try_call_command/4 (src/ejabberd_ctl.erl, line 297)
   in call from ejabberd_ctl:process2/4 (src/ejabberd_ctl.erl, line 235)
   in call from ejabberd_ctl:process/2 (src/ejabberd_ctl.erl, line 213)
   in call from erpc:execute_call/4 (erpc.erl, line 392) 

Also tried running the following snippet suggested by @prefiks in all my three nodes in cluster. But the room is still in some error state and not completely deleted. I am not finding anything in error.log

Hm, looks like table used for keeping mapping from room name to handler processes have some bad entries. Is that cluster or single node instance? Do you have any entry in error.log that talk about mod_muc?

You should be able to clean those records with something like that:

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

in debug console (just replace host at end).

Please help me in resolving the issue

logicwonder avatar Dec 14 '22 07:12 logicwonder

Could you try executing those commands for your room and see what they return?

mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>).
rpc:pinfo(mod_muc_admin:get_room_pid(<<"room">>, <<"conference.myserver.com">>)).

prefiks avatar Dec 15 '22 10:12 prefiks

I got error executing these commands

([email protected])7> mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>).
** exception error: undefined function mod_muc_admin:get_room_pid/2

([email protected])6> rpc:pinfo(mod_muc_admin:get_room_pid(<<"xxxxxx">>, <<"conference.xxx.xxx.xxx">>)). ** exception error: undefined function mod_muc_admin:get_room_pid/2

logicwonder avatar Dec 16 '22 05:12 logicwonder

Ah, right it's only available in 22.10, so let's try this:

(fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"room">>, <<"conference.server">>).

prefiks avatar Dec 16 '22 08:12 prefiks

Thanks for the response. I restarted one of the nodes and then I was able to delete the room. I shall try the script the next time when similar issue is faced.

Thanks for you support.

I got the following response for the above command:

{<0.17927.3300>, [{current_function,{p1_fsm,collect_messages,3}}, {initial_call,{proc_lib,init_p,5}}, {status,waiting}, {message_queue_len,0}, {links,[<0.11423.0>]}, {dictionary,[{'$ancestors',['mod_muc_room_sup_xxx.xxx.xx', 'mod_muc_sup_xxx.xxx.xx',ejabberd_gen_mod_sup, ejabberd_sup,<0.11264.0>]}, {'$internal_queue_len',0}, {rand_seed,{#{bits => 58,jump => #Fun<rand.3.92093067>, next => #Fun<rand.0.92093067>,type => exsss, uniform => #Fun<rand.1.92093067>, uniform_n => #Fun<rand.2.92093067>}, [256314510256950933|203838542858434267]}}, {'$initial_call',{mod_muc_room,init,1}}]}, {trap_exit,true}, {error_handler,error_handler}, {priority,normal}, {group_leader,<0.11263.0>}, {total_heap_size,10971}, {heap_size,4185}, {stack_size,18}, {reductions,757200}, {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}}, {min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,65535}, {minor_gcs,36}]}, {suspending,[]}]}

logicwonder avatar Dec 17 '22 10:12 logicwonder

Hi, is there any task to do regarding this issue, or can it get closed?

badlop avatar Jun 21 '23 12:06 badlop

I have faced the same issue when tried to run mod_muc_admin:get_room_options. I have also 3 nodes on cluster and no hibernation option defined in the .yml file (as far as I understand, it means no hibernation in any case).

@prefiks can you please explain me what that script do?

lists:foreach(fun({R, H, P}) ->case node(P)==node() andalso not is_process_alive(P) of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)).

Also I need to understand why this situation occurs and how can I catch it before the end users will face it in the client side?

relbraun avatar Jul 12 '23 08:07 relbraun

Most likelly, tables keeping info about active rooms got desynched between cluster nodes (probably there was problem with network between nodes when information about room changes happened, and there is stalled info about room somewhere). This script tries to sync rooms table state with what actual rooms are running.

prefiks avatar Jul 12 '23 10:07 prefiks

@prefiks thanks for the response. Is there any way I can be aware about such situation before the client experiences it?

And also, regarding the network problems you have mentioned, we use AWS EKS who (as we think) not likely to have such network problems. Do you know about such problems in EKS?

relbraun avatar Jul 12 '23 12:07 relbraun

Hi, I have this problem every 2 days in my 3 nodes cluster. Now there is a muc that returns error when I run mod_muc_admin:get_room_options on each node. Any idea how to solve it? And moreover, how can I prevent this situation? @prefiks please maybe you have any idea?

relbraun avatar Nov 29 '23 15:11 relbraun

And also, when I run this: (fun(R,S) ->H = ejabberd_router:host_of_route(S), case mod_muc:unhibernate_room(H,S,R) of error -> no_room; {ok, P} -> {P, rpc:pinfo(P)} end end)(<<"roomid">>, <<"conference.myswerver.com">>).

I get this: {<31537.5036.1>,undefined}

relbraun avatar Nov 29 '23 18:11 relbraun

You could try running this on each node, it possibly could help: {code} lists:foreach(fun({R, H, P}) ->case node(P)/=node() andalso rpc:pinfo(P, status) == undefined of true -> mod_muc:room_destroyed(H, R, P, ejabberd_router:host_of_route(H)); _ -> false end end, mod_muc:get_online_rooms(<<"conference.myserver.com">>)). {code}

prefiks avatar Nov 30 '23 09:11 prefiks

Thanks @prefiks

But what causes this situation and how can we avoid it or at least, intercept it immediately when it occurs?

relbraun avatar Nov 30 '23 10:11 relbraun

@prefiks In my experience, the inconsistency between cluster nodes is bound to happen either because of network or other reasons. Could you please suggest a cursory check (an Erlang script) that can detect such inconsistencies in a cluster. This could be of great help and may be effective to support the issues raised in future.

logicwonder avatar Nov 30 '23 10:11 logicwonder

This is most likely result of network issues between node clusters, probably at some node got dropped from a cluster, when other nodes couldn't reach it, and it could happen that operations execute in that time aren't properly propagated when connection get restored. Hard to tell when desync like that happens, it probably would require comparing table content of all nodes but this gets tricky since those tables can change underneath. Maybe just checking if size of all tables match can work in most cases? {code} ejabberd_cluster:multicall(mnesia, table_info, [muc_online_room, size]). {code} This should return {[SizeA, SizeB, SizeC, ...], [NotReachableNodeA, ...]}, if Size* are different, then likelly tables are desynced.

prefiks avatar Nov 30 '23 11:11 prefiks