resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

nfsserver monitor reports healthy service, but clients are unable to mount

Open pvaldria opened this issue 3 years ago • 7 comments

NFS client suddenly throws below errors and no long mount the filesystem, even after repeated attempts (including rebooting client node).

May 30 10:37:42 client-2 kernel: nfs: server 10.0.1.210 not responding, timed out
May 30 10:37:47 client-2 kernel: nfs: server 10.0.1.210 not responding, timed out
May 30 10:37:52 client-2 kernel: nfs: server 10.0.1.210 not responding, timed out

Mount from client node fails with 2 types of errors:

Error1: portmap query failed: RPC: Remote system error - Connection timed out

[opc@client-2 ~]$ sudo mount -v /mnt/nfs
mount.nfs: trying text-based options 'vers=3,bg,timeo=100,ac,actimeo=120,nocto,rsize=1048576,wsize=1048576,nolock,local_lock=none,proto=tcp,sec=sys,addr=10.0.1.210'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 10.0.1.210 prog 100003 vers 3 prot TCP port 2049
mount.nfs: portmap query failed: RPC: Remote system error - Connection timed out
mount.nfs: backgrounding "10.0.1.210:/mnt/nfsshare/exports"
mount.nfs: mount options: "rw,noatime,nodiratime,vers=3,bg,timeo=100,ac,actimeo=120,nocto,rsize=1048576,wsize=1048576,nolock,local_lock=none,proto=tcp,sec=sys,_netdev"

Error2: portmap query failed: RPC: Timed out

[opc@client-2 ~]$ sudo mount -v /mnt/nfs
mount.nfs: trying text-based options 'vers=3,bg,timeo=100,ac,actimeo=120,nocto,rsize=1048576,wsize=1048576,nolock,local_lock=none,proto=tcp,sec=sys,addr=10.0.1.210'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query failed: RPC: Timed out
mount.nfs: backgrounding "10.0.1.210:/mnt/nfsshare/exports"
mount.nfs: mount options: "rw,noatime,nodiratime,vers=3,bg,timeo=100,ac,actimeo=120,nocto,rsize=1048576,wsize=1048576,nolock,local_lock=none,proto=tcp,sec=sys,_netdev"

Troubleshooting: On File server node - NFS service monitoring shows service is fine and in fact, if we run a NFS client on the file server itself, it mounts correctly. So we investigated why NFS service monitor says "all good" , but client cannot mount it and found this check which works on server locally (used by NFS service monitor), but fails from client node.

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/nfsserver#L726

[opc@storage-server-1 ~]$ rpcinfo -t 10.0.2.233 100024
 program 100024 version 1 ready and waiting
  
 [opc@client-2 ~]$ rpcinfo -t 10.0.1.210 100024
 rpcinfo: RPC: Port mapper failure - Timed out
 program 100024 is not available

Seems like, the monitoring requires additional checks to ensure NFS clients can connect and mount and if not, return monitoring failed.

Is there a workaround , anyone can suggest, so the IO error doesn't happen ?

pvaldria avatar May 31 '21 15:05 pvaldria

PCS reports healthy status :

[root@storage-server-1 corosync]# pcs status
Cluster name: nfs_cluster
Stack: corosync
Current DC: storage-server-1 (version 1.1.23-1.0.1.el7-9acf116022) - partition with quorum
Last updated: Mon May 31 03:35:06 2021
Last change: Mon May 31 03:08:32 2021 by root via cibadmin on storage-server-1

2 nodes configured
6 resource instances configured

Online: [ storage-server-1 storage-server-2 ]

Full list of resources:

 sbd_fencing_storage-server-1	(stonith:fence_sbd):	Started storage-server-2
 sbd_fencing_storage-server-2	(stonith:fence_sbd):	Started storage-server-1
 Resource Group: nfsgroup
     disk	(ocf::heartbeat:LVM-activate):	Started storage-server-1
     nfsshare	(ocf::heartbeat:Filesystem):	Started storage-server-1
     nfs-daemon	(ocf::heartbeat:nfsserver):	Started storage-server-1
     nfs_VIP	(ocf::heartbeat:IPaddr2):	Started storage-server-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  sbd: active/enabled
[root@storage-server-1 corosync]#

pvaldria avatar May 31 '21 15:05 pvaldria

Can you run pcs resource pcs resource debug-monitor --full nfs-daemon and paste the output here?

oalbrigt avatar Jun 02 '21 08:06 oalbrigt

Attached is the output.

[root@storage-server-1 ~]# pcs resource debug-monitor --full nfs-daemon > pcs_resource_debug-monitor--full_nfs-daemon.output.txt

pvaldria avatar Jun 02 '21 09:06 pvaldria

Everything seems fine there, so the question is why does it stop answering replies to the IP.

Is the servers IP brought down and up, or maybe moved as a VIP or something like that? Or maybe blocked by the firewall if you have any fail2ban or similar running?

oalbrigt avatar Jun 02 '21 11:06 oalbrigt

We had a page allocation error at the kernel and since then, we see this behavior. Firewall is fine, because it was working fine for several days and then stopped to work, no changes were done to the network/firewall (infact firewall is stop). The VIP IP is assigned to the server and working (tested using telnet port 22). Also locally on the file server, we mounted the nfs file system using VIP.

As a workaround - Should we add an external check from one client node - using some monitor tool to check if below command works

[opc@client-2 ~]$ rpcinfo -t 10.0.1.210 100024

and if not, trigger

pcs resource move nfsgroup

pvaldria avatar Jun 02 '21 12:06 pvaldria

That might be a way to do it. Do a pcs resource clear nfsgroup as well, so it can move back, as I think the issue wouldnt be there if you move it back later.

oalbrigt avatar Jun 03 '21 12:06 oalbrigt