rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

RabbitMQ 3.10.7 - Windows - disk check crashing

Open lukebakken opened this issue 2 years ago • 22 comments

Related disk monitor issues:

  • #5726
  • #5721
  • #5681
  • #4140
  • #4328 (not implemented)
  • #3895
  • https://github.com/rabbitmq/rabbitmq-server/discussions/4753

Other disk monitor issues:

  • #1959
  • #2095
  • erlang/otp#6156
  • erlang/otp#6284
  • erlang/otp#6283

Discussed in https://github.com/rabbitmq/rabbitmq-server/discussions/5767

Originally posted by duncanjread September 13, 2022 We have an issue with the disk checker crashing/restarting RabbitMQ every day, exactly 24 hours after the service was restarted. We saw this in previous versions with errors about the PowerShell script timing out so updated to 3.10.7 with Erlang 24.3.3 to try and stop this happening, I believe the disk checking was changed to use Handle. It still happens, though the error message is different. It's running on a Windows Server 2016 core VM with 12GB RAM and 12 vCores. We have four servers in total, at two datacentres, each server is in a cluster of two. They all have the same behaviour. Is there anyway to disable the disk checking completely, we have a completely separate monitoring system and these are dedicated RabbitMQ servers with plenty of disk space.

2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Generic server rabbit_disk_monitor terminating
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Last message in was update
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** When Server state == {state,"c:/RabbitMQBase/db/rabbit@EDB-INT-RMQ-01-mnesia",
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                                50000000,17185275904,100,10000,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                                #Ref<0.4175755648.1574174723.107504>,false,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                                true,10,120000}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** Reason for termination ==
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> ** {{timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {rabbit_disk_monitor,get_disk_free,2,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                           [{file,"rabbit_disk_monitor.erl"},{line,260}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {rabbit_disk_monitor,internal_update,1,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                           [{file,"rabbit_disk_monitor.erl"},{line,217}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {rabbit_disk_monitor,handle_info,2,
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>                           [{file,"rabbit_disk_monitor.erl"},{line,166}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> 
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>   crasher:
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     initial call: rabbit_disk_monitor:init/1
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     pid: <0.430.0>
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     registered_name: rabbit_disk_monitor
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     exception exit: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in function  gen_server:call/2 (gen_server.erl, line 239)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in call from rabbit_disk_monitor:get_disk_free/2 (rabbit_disk_monitor.erl, line 260)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in call from rabbit_disk_monitor:internal_update/1 (rabbit_disk_monitor.erl, line 217)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in call from rabbit_disk_monitor:handle_info/2 (rabbit_disk_monitor.erl, line 166)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>       in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.218.0>]
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     message_queue_len: 0
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     messages: []
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     links: [<0.429.0>]
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     dictionary: []
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     trap_exit: false
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     status: running
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     heap_size: 6772
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     stack_size: 29
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>     reductions: 1087159
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0>   neighbours:
2022-09-13 10:47:30.869000+01:00 [error] <0.430.0> 
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>     supervisor: {local,rabbit_disk_monitor_sup}
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>     errorContext: child_terminated
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>     reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>     offender: [{pid,<0.430.0>},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>                {id,rabbit_disk_monitor},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>                {mfargs,{rabbit_disk_monitor,start_link,[50000000]}},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>                {restart_type,{transient,1}},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>                {shutdown,300000},
2022-09-13 10:47:30.884000+01:00 [error] <0.429.0>                {child_type,worker}]
2022-09-13 10:47:30.888000+01:00 [info] <0.27972.2> Enabling free disk space monitoring
2022-09-13 10:47:30.888000+01:00 [info] <0.27972.2> Disk free limit set to 50MB
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Generic server rabbit_disk_monitor terminating
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Last message in was update
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** When Server state == {state,"c:/RabbitMQBase/db/rabbit@EDB-INT-RMQ-01-mnesia",
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                                50000000,17166462976,100,10000,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                                #Ref<0.4175755648.1574174725.34453>,false,true,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                                10,120000}
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** Reason for termination ==
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> ** {{timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>     [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {rabbit_disk_monitor,get_disk_free,2,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                           [{file,"rabbit_disk_monitor.erl"},{line,260}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {rabbit_disk_monitor,internal_update,1,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                           [{file,"rabbit_disk_monitor.erl"},{line,217}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {rabbit_disk_monitor,handle_info,2,
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>                           [{file,"rabbit_disk_monitor.erl"},{line,166}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,695}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,771}]},
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2022-09-13 10:47:51.025000+01:00 [error] <0.27972.2> 
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>   crasher:
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     initial call: rabbit_disk_monitor:init/1
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     pid: <0.27972.2>
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     registered_name: rabbit_disk_monitor
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     exception exit: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in function  gen_server:call/2 (gen_server.erl, line 239)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in call from rabbit_disk_monitor:get_disk_free/2 (rabbit_disk_monitor.erl, line 260)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in call from rabbit_disk_monitor:internal_update/1 (rabbit_disk_monitor.erl, line 217)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in call from rabbit_disk_monitor:handle_info/2 (rabbit_disk_monitor.erl, line 166)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in call from gen_server:try_dispatch/4 (gen_server.erl, line 695)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>       in call from gen_server:handle_msg/6 (gen_server.erl, line 771)
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.218.0>]
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     message_queue_len: 0
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     messages: []
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     links: [<0.429.0>]
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     dictionary: []
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     trap_exit: false
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     status: running
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     heap_size: 6772
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     stack_size: 29
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>     reductions: 20233
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2>   neighbours:
2022-09-13 10:47:51.026000+01:00 [error] <0.27972.2> 
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>     supervisor: {local,rabbit_disk_monitor_sup}
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>     errorContext: child_terminated
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>     reason: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>     offender: [{pid,<0.27972.2>},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>                {id,rabbit_disk_monitor},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>                {mfargs,{rabbit_disk_monitor,start_link,[50000000]}},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>                {restart_type,{transient,1}},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>                {shutdown,300000},
2022-09-13 10:47:51.026000+01:00 [error] <0.429.0>                {child_type,worker}]
2022-09-13 10:47:51.033000+01:00 [info] <0.27994.2> Enabling free disk space monitoring
2022-09-13 10:47:51.033000+01:00 [info] <0.27994.2> Disk free limit set to 50MB

lukebakken avatar Sep 13 '22 13:09 lukebakken

@duncanjread - thanks for your report. Even though these errors look serious, the disk monitor will restart. There is no way at this time to disable disk space checking.

This error will NOT crash RabbitMQ or cause it to restart. I am assuming it is not.

I changed the disk check to use a simple utility that ships with Erlang that uses the Win32 API to read the free disk space. I'm pretty surprised to see that retrieving disk space is timing out.

What is the rest of your system load like when these free space checks time out? If it happens at exactly the same time in all systems, what else is happening to add load to the servers?

Finally, two-node clusters are never recommended for RabbitMQ. Three nodes is the minimum.

lukebakken avatar Sep 13 '22 13:09 lukebakken

@lukebakken Hi, thank's for the reply. Yes, you're right it's probably not crashing Rabbit per se, but we can't publish any messages onto the queues at those times. We get error messages about not being able to publish, eg:

Failed to publish SettlementAttempt (1/1) : 1015849/28243/13092022095225083 - System.OperationCanceledException: The operation was canceled. at GreenPipes.Internals.Extensions.TaskExtensions.<>c__DisplayClass4_0.<<OrCanceled>g__WaitAsync|0>d.MoveNext()

or

Failed to publish SettlementAttempt (1/1) : 798304/25703/12092022174639000 - RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=541, text='Unexpected Exception', classId=0, methodId=0, cause=System.Net.Sockets.SocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

We've looked at load and logs pretty extensively as this has been going on for a while, I only managed to update to 3.10.7 yesterday. Also AV etc and have ruled them all out, the server's not under any load when this happens and as far as I can tell nothing else is happening on the server or network.

All I can see is that on each node exactly 24 hours after the last service restart, the disk checker has problems. As I updated each node about 10 minutes apart, the disk errors are 10 minutes apart on each server, They don't all happen at the same time across the board. I'll grab the whole logs.

I can add another server to each cluster pretty easily and quickly.

Thanks!

duncanjread avatar Sep 13 '22 14:09 duncanjread

I see in the logs that RabbitMQ was restarted several times, like here (from node 02):

2022-09-12 10:51:00.341000+01:00 [info] <0.581.0> Server startup complete; 6 plugins started.
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_shovel_management
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_shovel
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_delayed_message_exchange
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_management
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_web_dispatch
2022-09-12 10:51:00.341000+01:00 [info] <0.581.0>  * rabbitmq_management_agent

2022-09-12 10:51:28.364000+01:00 [info] <0.93.0> RabbitMQ is asked to stop...
2022-09-12 10:51:28.848000+01:00 [info] <0.93.0> Stopping RabbitMQ applications and their dependencies in the following order:

Notice that the stop happened right after the start. This is a restart that happened after you had updated to 3.10.7. Is there some monitoring tool that looks for errors in the log file and then auto-restarts RabbitMQ?

The exception from your application shows that it could not connect to RabbitMQ (which of course prevents publishing). This could happen during one of the restarts that is shown in the logs.

At any rate, I can adjust the code to take these sorts of timeouts into account. Since you can reproduce this issue in your environment if you could test out my fixes that would be great (I can provide an installer). Thanks.

lukebakken avatar Sep 13 '22 14:09 lukebakken

Hi, that might have been me, I rebooted the servers after I updated it. Although there is monitoring, there's nothing which will restart a service. I've been double checking the AV exclusions, but seem to be fine.

If you could that would be great, I can update pretty quickly and test as it's causing us a few issues at the moment. thanks!

duncanjread avatar Sep 13 '22 14:09 duncanjread

Aha, antivirus. You know I bet that's the cause. Can you completely disable it for a day or two to see if that fixes the issue?

lukebakken avatar Sep 13 '22 14:09 lukebakken

Again, these disk monitor errors should not affect your applications. I'm not convinced this is what is causing your application errors. Check these log entries from node 0:

2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> accepting AMQP connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671)
2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671) has a client-provided name: w3wp
2022-09-12 10:50:20.710000+01:00 [info] <0.1241.0> connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671 - w3wp): user 'settlementInternal' authenticated and granted access to vhost '/'

2022-09-13 03:30:04.215000+01:00 [warning] <0.1241.0> closing AMQP connection <0.1241.0> (10.12.1.1:11501 -> 10.12.1.201:5671 - w3wp, vhost: '/', user: 'settlementInternal'):
2022-09-13 03:30:04.215000+01:00 [warning] <0.1241.0> client unexpectedly closed TCP connection

client unexpectedly closed TCP connection means that your application or a network device closed the TCP socket, not RabbitMQ. At that point in time in the logs, there are no disk monitor errors.

lukebakken avatar Sep 13 '22 14:09 lukebakken

Hi, I can try and turn the AV off on one of the servers and let it run without it, I'd need to get sign off to do that.

Those 03:00 and 03:30 events are when we recycle IIS on the client servers overnight, I think it just severs the connection.

duncanjread avatar Sep 13 '22 14:09 duncanjread

Thanks. As you've seen we've been dealing with this issue on Windows for a while now and I've never been able to reproduce it, which is why I'm guessing it may be AV related.

lukebakken avatar Sep 13 '22 15:09 lukebakken

a way to toggle it on or off would be ideal?

duncanjread avatar Sep 13 '22 15:09 duncanjread

I can see what the rest of the team thinks. The problem is that as soon as we provide that as an option, we have to maintain it, and we also have to deal with users who disable it and then complain when they fill their disks and RabbitMQ stops working.

lukebakken avatar Sep 13 '22 16:09 lukebakken

@duncanjread question - do these servers ONLY have fixed, local storage or do they also have SAN disks, network mapped drives, anything not local on them?

lukebakken avatar Sep 13 '22 17:09 lukebakken

If you could run this command on one or more servers, that would be great:

rabbitmqctl.bat eval "os_mon_sysinfo:get_disk_info()."

The output will be something like this:

["C:\\ DRIVE_FIXED 587094192128 1013310287872 587094192128\n",
 "Y:\\ DRIVE_REMOTE 0 0 0\n","Z:\\ DRIVE_REMOTE 0 0 0\n"]

lukebakken avatar Sep 13 '22 18:09 lukebakken

@duncanjread once you're done collecting information for me (which I very much appreciate), see this issue:

https://github.com/rabbitmq/rabbitmq-server/pull/1959

You could disable the disk monitor by running the following:

rabbitmqctl.bat eval "rabbit_disk_monitor:set_enabled(false)."

BUT! Be sure that it has at least read the free space once:

rabbitmqctl.bat eval "rabbit_disk_monitor:get_disk_free()."
587086852096

...or a disk alarm may be set.

lukebakken avatar Sep 13 '22 18:09 lukebakken

Disabling free disk space monitoring has risks. A lot of people do appreciate these built-in RabbitMQ metrics because they don't monitor anything externally.

michaelklishin avatar Sep 13 '22 20:09 michaelklishin

Hi, sorry I was off yesterday.

The servers only have fixed local storage.

I'll run through those commands as well and get the output back to you.

Our Security Team aren't keen on disabling the AV fully. I added some exclusions in the day before yesterday (in fact some were already in place) but it happened yesterday again. I'll keep on for a full disable of the AV at least to fully full that out.

Thanks for your help with this, we really appreciate it!

duncanjread avatar Sep 15 '22 08:09 duncanjread

Hi, I've run the commands on all four servers:

MST-INT-RMQ-01 ["C:\ DRIVE_FIXED 13530664960 63830994944 13530664960\n", "D:\ DRIVE_CDROM 0 0 0\n"]

MST-INT-RMQ-02 ["C:\ DRIVE_FIXED 11020615680 53092528128 11020615680\n", "D:\ DRIVE_CDROM 0 0 0\n"]

EDB-INT-RMQ-01 ["C:\ DRIVE_FIXED 17169444864 63830994944 17169444864\n", "D:\ DRIVE_CDROM 0 0 0\n"]

EDB-INT-RMQ-02 ["C:\ DRIVE_FIXED 10172817408 58462285824 10172817408\n", "D:\ DRIVE_CDROM 0 0 0\n"]

I've also run the disk space check commands, and will check tomorrow how those get on and report back

duncanjread avatar Sep 15 '22 11:09 duncanjread

but it happened yesterday again

Did it happen at the same time as the other events?

lukebakken avatar Sep 15 '22 13:09 lukebakken

Only one of the servers had application errors in our logs and it was the same time as usual. However, there weren't any disk check errors in the logs at that time.

one thing to note is that we only get application errors if the apps being used at the time the issue happens, if nothings going through it we wouldn't see anything going wrong.

duncanjread avatar Sep 15 '22 14:09 duncanjread

Hi, just checking over the weekend and yesterday, we don't have any errors in the logs now, apart from when clients are disconnected during our 03:00 and 03:30 IIS resets. We are still getting application errors still exactly 24 hours from the time the service was reset. Is there anything in RabbitMQ that has a 24 hour timer that happens once a day? I've added exclusions in the AV to various folders, but haven't yet got permission to turn the whole AV off on one server

duncanjread avatar Sep 20 '22 08:09 duncanjread

There are no timers that use such long intervals in RabbitMQ.

michaelklishin avatar Sep 20 '22 13:09 michaelklishin

Hey @duncanjread that's very interesting and suggests that AV is affecting the free disk space query.

As for your application errors, does RabbitMQ log anything at that time? What is the exact error your application reports? Is it the same as in this comment?

lukebakken avatar Sep 20 '22 14:09 lukebakken

Hi, I've made a bit of a break through with this, We also use a piece of SIEM software called Wazuh, that has a built in module called - Security Configuration Assessment (SCA). This runs every 12 hours and does some kind of check against the registry and then matches it up with some rules to produce a report of "bad" registry entries. It doesn't look that invasive and says it takes 3 seconds to run, but the times match up with the issues in our logs across all the servers. I think the 24 hour thing was a red herring, because I rebooted the servers then which would have reset the Wazuh timer as well as restarting RabbitMQ I can try and reach out to Wazuh and see why the scan would be causing these problems.

I've disabled it now and will monitor across the next few days and let you know how it goes!

Thanks!

duncanjread avatar Sep 23 '22 07:09 duncanjread

Hey @duncanjread - care to provide an update? I'm assuming you haven't seen this issue again.

lukebakken avatar Sep 30 '22 15:09 lukebakken

Hi, it is still happening, I've shifted my attention onto what the Wazuh checks are doing. I've opened a ticket with them but have yet to run through a few things they've suggested.

duncanjread avatar Sep 30 '22 15:09 duncanjread

Thanks for the update. I'm assuming it's still happening at the same time as the "Wazuh" SCA check.

lukebakken avatar Oct 03 '22 16:10 lukebakken

thats the strange thing, I've disabled the checks and it's still happening at the same times as before, when those SCA checks used to run. Still investigating at the moment

duncanjread avatar Oct 05 '22 13:10 duncanjread