rabbitmq-server
rabbitmq-server copied to clipboard
RabbitMQ 3.10.7 - Windows - disk check crashing
Related disk monitor issues:
- #5726
- #5721
- #5681
- #4140
- #4328 (not implemented)
- #3895
- https://github.com/rabbitmq/rabbitmq-server/discussions/4753
Other disk monitor issues:
- #1959
- #2095
- erlang/otp#6156
- erlang/otp#6284
- erlang/otp#6283
Discussed in https://github.com/rabbitmq/rabbitmq-server/discussions/5767
Originally posted by duncanjread September 13, 2022 We have an issue with the disk checker crashing/restarting RabbitMQ every day, exactly 24 hours after the service was restarted. We saw this in previous versions with errors about the PowerShell script timing out so updated to 3.10.7 with Erlang 24.3.3 to try and stop this happening, I believe the disk checking was changed to use Handle. It still happens, though the error message is different. It's running on a Windows Server 2016 core VM with 12GB RAM and 12 vCores. We have four servers in total, at two datacentres, each server is in a cluster of two. They all have the same behaviour. Is there anyway to disable the disk checking completely, we have a completely separate monitoring system and these are dedicated RabbitMQ servers with plenty of disk space.
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Generic server rabbit_disk_monitor terminating
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Last message in was update
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** When Server state == {state,"c:/RabbitMQBase/db/rabbit@EDB-INT-RMQ-01-mnesia",
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> 50000000,17185275904,100,10000,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> #Ref<0.4175755648.1574174723.107504>,false,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> true,10,120000}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Reason for termination ==
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** {{timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {rabbit_disk_monitor,get_disk_free,2,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> [{file,"rabbit_disk_monitor.erl"},{line,260}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {rabbit_disk_monitor,internal_update,1,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> [{file,"rabbit_disk_monitor.erl"},{line,217}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {rabbit_disk_monitor,handle_info,2,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> [{file,"rabbit_disk_monitor.erl"},{line,166}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> crasher:
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> initial call: rabbit_disk_monitor:init/1
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> pid: <0.430.0>
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> registered_name: rabbit_disk_monitor
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> exception exit: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in function gen_server:call/2 (gen_server.erl, line 239)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in call from rabbit_disk_monitor:get_disk_free/2 (rabbit_disk_monitor.erl, line 260)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in call from rabbit_disk_monitor:internal_update/1 (rabbit_disk_monitor.erl, line 217)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in call from rabbit_disk_monitor:handle_info/2 (rabbit_disk_monitor.erl, line 166)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.218.0>]
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> message_queue_len: 0
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> messages: []
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> links: [<0.429.0>]
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> dictionary: []
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> trap_exit: false
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> status: running
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> heap_size: 6772
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> stack_size: 29
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> reductions: 1087159
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> neighbours:
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> supervisor: {local,rabbit_disk_monitor_sup}
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> errorContext: child_terminated
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> offender: [{pid,<0.430.0>},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> {id,rabbit_disk_monitor},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> {mfargs,{rabbit_disk_monitor,start_link,[50000000]}},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> {restart_type,{transient,1}},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> {shutdown,300000},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0> {child_type,worker}]
2022-09-13 10:47:30.888000+01:00 [info] <0.27972.2> Enabling free disk space monitoring
2022-09-13 10:47:30.888000+01:00 [info] <0.27972.2> Disk free limit set to 50MB
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Generic server rabbit_disk_monitor terminating
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Last message in was update
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** When Server state == {state,"c:/RabbitMQBase/db/rabbit@EDB-INT-RMQ-01-mnesia",
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> 50000000,17166462976,100,10000,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> #Ref<0.4175755648.1574174725.34453>,false,true,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> 10,120000}
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Reason for termination ==
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** {{timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {rabbit_disk_monitor,get_disk_free,2,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> [{file,"rabbit_disk_monitor.erl"},{line,260}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {rabbit_disk_monitor,internal_update,1,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> [{file,"rabbit_disk_monitor.erl"},{line,217}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {rabbit_disk_monitor,handle_info,2,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> [{file,"rabbit_disk_monitor.erl"},{line,166}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> crasher:
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> initial call: rabbit_disk_monitor:init/1
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> pid: <0.27972.2>
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> registered_name: rabbit_disk_monitor
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> exception exit: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in function gen_server:call/2 (gen_server.erl, line 239)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in call from rabbit_disk_monitor:get_disk_free/2 (rabbit_disk_monitor.erl, line 260)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in call from rabbit_disk_monitor:internal_update/1 (rabbit_disk_monitor.erl, line 217)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in call from rabbit_disk_monitor:handle_info/2 (rabbit_disk_monitor.erl, line 166)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.218.0>]
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> message_queue_len: 0
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> messages: []
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> links: [<0.429.0>]
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> dictionary: []
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> trap_exit: false
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> status: running
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> heap_size: 6772
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> stack_size: 29
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> reductions: 20233
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> neighbours:
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> supervisor: {local,rabbit_disk_monitor_sup}
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> errorContext: child_terminated
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> offender: [{pid,<0.27972.2>},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> {id,rabbit_disk_monitor},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> {mfargs,{rabbit_disk_monitor,start_link,[50000000]}},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> {restart_type,{transient,1}},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> {shutdown,300000},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0> {child_type,worker}]
2022-09-13 10:47:51.033000+01:00 [info] <0.27994.2> Enabling free disk space monitoring
2022-09-13 10:47:51.033000+01:00 [info] <0.27994.2> Disk free limit set to 50MB
@duncanjread - thanks for your report. Even though these errors look serious, the disk monitor will restart. There is no way at this time to disable disk space checking.
This error will NOT crash RabbitMQ or cause it to restart. I am assuming it is not.
I changed the disk check to use a simple utility that ships with Erlang that uses the Win32 API to read the free disk space. I'm pretty surprised to see that retrieving disk space is timing out.
What is the rest of your system load like when these free space checks time out? If it happens at exactly the same time in all systems, what else is happening to add load to the servers?
Finally, two-node clusters are never recommended for RabbitMQ. Three nodes is the minimum.
@lukebakken Hi, thank's for the reply. Yes, you're right it's probably not crashing Rabbit per se, but we can't publish any messages onto the queues at those times. We get error messages about not being able to publish, eg:
Failed to publish SettlementAttempt (1/1) : 1015849/28243/13092022095225083 - System.OperationCanceledException: The operation was canceled. at GreenPipes.Internals.Extensions.TaskExtensions.<>c__DisplayClass4_0.<<OrCanceled>g__WaitAsync|0>d.MoveNext()
or
Failed to publish SettlementAttempt (1/1) : 798304/25703/12092022174639000 - RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=541, text='Unexpected Exception', classId=0, methodId=0, cause=System.Net.Sockets.SocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
We've looked at load and logs pretty extensively as this has been going on for a while, I only managed to update to 3.10.7 yesterday. Also AV etc and have ruled them all out, the server's not under any load when this happens and as far as I can tell nothing else is happening on the server or network.
All I can see is that on each node exactly 24 hours after the last service restart, the disk checker has problems. As I updated each node about 10 minutes apart, the disk errors are 10 minutes apart on each server, They don't all happen at the same time across the board. I'll grab the whole logs.
I can add another server to each cluster pretty easily and quickly.
Thanks!
I see in the logs that RabbitMQ was restarted several times, like here (from node 02):
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> Server startup complete; 6 plugins started.
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_shovel_management
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_shovel
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_delayed_message_exchange
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_management
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_web_dispatch
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> * rabbitmq_management_agent
2022-09-12 10:51:28.364000+01:00 [info] <0.93.0> RabbitMQ is asked to stop...
2022-09-12 10:51:28.848000+01:00 [info] <0.93.0> Stopping RabbitMQ applications and their dependencies in the following order:
Notice that the stop happened right after the start. This is a restart that happened after you had updated to 3.10.7. Is there some monitoring tool that looks for errors in the log file and then auto-restarts RabbitMQ?
The exception from your application shows that it could not connect to RabbitMQ (which of course prevents publishing). This could happen during one of the restarts that is shown in the logs.
At any rate, I can adjust the code to take these sorts of timeouts into account. Since you can reproduce this issue in your environment if you could test out my fixes that would be great (I can provide an installer). Thanks.
Hi, that might have been me, I rebooted the servers after I updated it. Although there is monitoring, there's nothing which will restart a service. I've been double checking the AV exclusions, but seem to be fine.
If you could that would be great, I can update pretty quickly and test as it's causing us a few issues at the moment. thanks!
Aha, antivirus. You know I bet that's the cause. Can you completely disable it for a day or two to see if that fixes the issue?
Again, these disk monitor errors should not affect your applications. I'm not convinced this is what is causing your application errors. Check these log entries from node 0:
2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> accepting AMQP connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671)
2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671) has a client-provided name: w3wp
2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671 - w3wp): user 'settlementInternal' authenticated and granted access to vhost '/'
2022-09-13 03:30:04.215000+01:00 [warning] <0.1241.0> closing AMQP connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671 - w3wp, vhost: '/', user: 'settlementInternal'):
2022-09-13 03:30:04.215000+01:00 [warning] <0.1241.0> client unexpectedly closed TCP connection
client unexpectedly closed TCP connection
means that your application or a network device closed the TCP socket, not RabbitMQ. At that point in time in the logs, there are no disk monitor errors.
Hi, I can try and turn the AV off on one of the servers and let it run without it, I'd need to get sign off to do that.
Those 03:00 and 03:30 events are when we recycle IIS on the client servers overnight, I think it just severs the connection.
Thanks. As you've seen we've been dealing with this issue on Windows for a while now and I've never been able to reproduce it, which is why I'm guessing it may be AV related.
a way to toggle it on or off would be ideal?
I can see what the rest of the team thinks. The problem is that as soon as we provide that as an option, we have to maintain it, and we also have to deal with users who disable it and then complain when they fill their disks and RabbitMQ stops working.
@duncanjread question - do these servers ONLY have fixed, local storage or do they also have SAN disks, network mapped drives, anything not local on them?
If you could run this command on one or more servers, that would be great:
rabbitmqctl.bat eval "os_mon_sysinfo:get_disk_info()."
The output will be something like this:
["C:\\ DRIVE_FIXED 587094192128 1013310287872 587094192128\n",
"Y:\\ DRIVE_REMOTE 0 0 0\n","Z:\\ DRIVE_REMOTE 0 0 0\n"]
@duncanjread once you're done collecting information for me (which I very much appreciate), see this issue:
https://github.com/rabbitmq/rabbitmq-server/pull/1959
You could disable the disk monitor by running the following:
rabbitmqctl.bat eval "rabbit_disk_monitor:set_enabled(false)."
BUT! Be sure that it has at least read the free space once:
rabbitmqctl.bat eval "rabbit_disk_monitor:get_disk_free()."
587086852096
...or a disk alarm may be set.
Disabling free disk space monitoring has risks. A lot of people do appreciate these built-in RabbitMQ metrics because they don't monitor anything externally.
Hi, sorry I was off yesterday.
The servers only have fixed local storage.
I'll run through those commands as well and get the output back to you.
Our Security Team aren't keen on disabling the AV fully. I added some exclusions in the day before yesterday (in fact some were already in place) but it happened yesterday again. I'll keep on for a full disable of the AV at least to fully full that out.
Thanks for your help with this, we really appreciate it!
Hi, I've run the commands on all four servers:
MST-INT-RMQ-01 ["C:\ DRIVE_FIXED 13530664960 63830994944 13530664960\n", "D:\ DRIVE_CDROM 0 0 0\n"]
MST-INT-RMQ-02 ["C:\ DRIVE_FIXED 11020615680 53092528128 11020615680\n", "D:\ DRIVE_CDROM 0 0 0\n"]
EDB-INT-RMQ-01 ["C:\ DRIVE_FIXED 17169444864 63830994944 17169444864\n", "D:\ DRIVE_CDROM 0 0 0\n"]
EDB-INT-RMQ-02 ["C:\ DRIVE_FIXED 10172817408 58462285824 10172817408\n", "D:\ DRIVE_CDROM 0 0 0\n"]
I've also run the disk space check commands, and will check tomorrow how those get on and report back
but it happened yesterday again
Did it happen at the same time as the other events?
Only one of the servers had application errors in our logs and it was the same time as usual. However, there weren't any disk check errors in the logs at that time.
one thing to note is that we only get application errors if the apps being used at the time the issue happens, if nothings going through it we wouldn't see anything going wrong.
Hi, just checking over the weekend and yesterday, we don't have any errors in the logs now, apart from when clients are disconnected during our 03:00 and 03:30 IIS resets. We are still getting application errors still exactly 24 hours from the time the service was reset. Is there anything in RabbitMQ that has a 24 hour timer that happens once a day? I've added exclusions in the AV to various folders, but haven't yet got permission to turn the whole AV off on one server
There are no timers that use such long intervals in RabbitMQ.
Hey @duncanjread that's very interesting and suggests that AV is affecting the free disk space query.
As for your application errors, does RabbitMQ log anything at that time? What is the exact error your application reports? Is it the same as in this comment?
Hi, I've made a bit of a break through with this, We also use a piece of SIEM software called Wazuh, that has a built in module called - Security Configuration Assessment (SCA). This runs every 12 hours and does some kind of check against the registry and then matches it up with some rules to produce a report of "bad" registry entries. It doesn't look that invasive and says it takes 3 seconds to run, but the times match up with the issues in our logs across all the servers. I think the 24 hour thing was a red herring, because I rebooted the servers then which would have reset the Wazuh timer as well as restarting RabbitMQ I can try and reach out to Wazuh and see why the scan would be causing these problems.
I've disabled it now and will monitor across the next few days and let you know how it goes!
Thanks!
Hey @duncanjread - care to provide an update? I'm assuming you haven't seen this issue again.
Hi, it is still happening, I've shifted my attention onto what the Wazuh checks are doing. I've opened a ticket with them but have yet to run through a few things they've suggested.
Thanks for the update. I'm assuming it's still happening at the same time as the "Wazuh" SCA check.
thats the strange thing, I've disabled the checks and it's still happening at the same times as before, when those SCA checks used to run. Still investigating at the moment