resource-agents
resource-agents copied to clipboard
Cause for FileSystem monitor timeout
I am using Pacemaker to manage a Postgres cluster, with 2 servers and a shared storage disk. The disks are mounted on the master/active node using the resource agent Filesystem.
The Filesystem monitor with default settings (interval=20s timeout=40s), timed out and caused the Postgres to failover.
I checked the disks usage, memory usage, Network IO, Disk IO etc., and everything looks normal before the aforementioned monitoring timeout.
I have been scratching my head to find out what caused that timeout. Information in the cluster logs only mentions about the timeout, but no reason. Is there a way to further investigate this issue? Could it have been the DELL storage issue?
If you're able to reproduce it you can set trace_ra=1 for the resource, and check the logfiles at the time it happens again (it will spam the logfiles, so you probably dont want to run it like that for a long time.
The problem is that it is difficult to reproduce. It has happened only twice in the last 2 years. The only common thing both times is that it was preceded by a heavy Postgres operation like DROP and COPY commands on super huge tables. Although these operations finished off a few minutes before the actual error, the gap is as much as 5 mins and Postgres logs show no signs of Postgres stopping right after the operations.
I tried reproducing it, but so far unsuccessful and that is causing me worries. Not sure if I should ask DELL to check storage to see if disks became unresponsive at that time.
I guess you could do that if they do some additional logging in their enclosures or similar.
I heard there's an improvement coming to Pacemaker to be able to debug these timeout issues much better in the future (where logs will be collected the last X seconds before it times out), but that will probably take 1-2 years before it's released.