resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

Cause for FileSystem monitor timeout

Open bhola05 opened this issue 4 years ago • 3 comments

I am using Pacemaker to manage a Postgres cluster, with 2 servers and a shared storage disk. The disks are mounted on the master/active node using the resource agent Filesystem.

The Filesystem monitor with default settings (interval=20s timeout=40s), timed out and caused the Postgres to failover.

I checked the disks usage, memory usage, Network IO, Disk IO etc., and everything looks normal before the aforementioned monitoring timeout.

I have been scratching my head to find out what caused that timeout. Information in the cluster logs only mentions about the timeout, but no reason. Is there a way to further investigate this issue? Could it have been the DELL storage issue?

bhola05 avatar Sep 28 '21 01:09 bhola05

If you're able to reproduce it you can set trace_ra=1 for the resource, and check the logfiles at the time it happens again (it will spam the logfiles, so you probably dont want to run it like that for a long time.

oalbrigt avatar Sep 29 '21 08:09 oalbrigt

The problem is that it is difficult to reproduce. It has happened only twice in the last 2 years. The only common thing both times is that it was preceded by a heavy Postgres operation like DROP and COPY commands on super huge tables. Although these operations finished off a few minutes before the actual error, the gap is as much as 5 mins and Postgres logs show no signs of Postgres stopping right after the operations.

I tried reproducing it, but so far unsuccessful and that is causing me worries. Not sure if I should ask DELL to check storage to see if disks became unresponsive at that time.

bhola05 avatar Sep 29 '21 08:09 bhola05

I guess you could do that if they do some additional logging in their enclosures or similar.

I heard there's an improvement coming to Pacemaker to be able to debug these timeout issues much better in the future (where logs will be collected the last X seconds before it times out), but that will probably take 1-2 years before it's released.

oalbrigt avatar Oct 01 '21 08:10 oalbrigt