sysinfo icon indicating copy to clipboard operation
sysinfo copied to clipboard

`refresh_disks_list` may stuck if statvfs hangs on some NFS device

Open CalvinNeo opened this issue 2 years ago • 1 comments

We found the statfs syscall may hang on some nfs devices with the following stack when we are trying to get a list of disks by refresh_disks_list.

[<ffffffffc0939f04>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffc093b6f4>] __rpc_execute+0x154/0x420 [sunrpc]
[<ffffffffc093cf78>] rpc_execute+0x68/0xc0 [sunrpc]
[<ffffffffc092c786>] rpc_run_task+0xf6/0x150 [sunrpc]
[<ffffffffc09e56d3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffc09e65c8>] _nfs4_proc_statfs+0xc8/0xf0 [nfsv4]
[<ffffffffc09f1d66>] nfs4_proc_statfs+0x66/0xa0 [nfsv4]
[<ffffffffc09b024e>] nfs_statfs+0x6e/0x190 [nfs]
[<ffffffff93a76de7>] statfs_by_dentry+0xa7/0x140
[<ffffffff93a76e9b>] vfs_statfs+0x1b/0xc0
[<ffffffff93a77175>] user_statfs+0x55/0xa0
[<ffffffff93a771e7>] SYSC_statfs+0x27/0x60
[<ffffffff93a773ee>] SyS_statfs+0xe/0x10
[<ffffffff93f74ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

These problems are also reported in https://github.com/prometheus/node_exporter/issues/868.

So maybe in refresh_disks_list we can avoid calling statfs if we find this device is a nfs from reading /proc/mounts.

CalvinNeo avatar Sep 22 '22 07:09 CalvinNeo

Thanks for the report. Considering I don't have a NFS device, I'll need someone else to write the fix.

GuillaumeGomez avatar Sep 22 '22 10:09 GuillaumeGomez

I have a nfs server installed on ubuntu and want to fix this issue. Does that mean that we should remove nfs from the return values of get_all_disks_inner?

Running cargo run --examples simple does actually give the following nfs. It hangs if the nfs service is stopped via systemctl

Disk("/dev/vda1")[FS: ['e', 'x', 't', '4']][Type: HDD][removable: no] mounted on "/": 21229899776/52776349696 B
Disk("172.16.0.7:/home/ubuntu/nfsroot")[FS: ['n', 'f', 's', '4']][Type: Unknown(-1)][removable: no] mounted on "/opt/172-16-0-7": 46360690688/52722401280 B

kayoch1n avatar Nov 23 '22 08:11 kayoch1n

Fixed by #876.

GuillaumeGomez avatar Nov 23 '22 12:11 GuillaumeGomez

I'm right that this was only happen when nfs mount had an error? In general I found it useful to get stats from nfs mounts. How about adding a timeout function and bring nfs back?

jb-alvarado avatar Nov 16 '23 20:11 jb-alvarado

No idea, I don't have an NFS to test. The timeout would be quite small if we don't want it to hang for too long, so not sure if it's a good idea...

GuillaumeGomez avatar Nov 16 '23 21:11 GuillaumeGomez

I made some quick tests, but would be nice if @CalvinNeo and @kayoch1n could tell in which situations they got the error when refresh_disks_list hangs on nfs mount.

First I found this issue on ansible, where they had a similar problem. And I also found a quick test solution on redhat page.

To reproduce I follow this steps on a current fedora linux:

  1. mkdir /opt/test /mnt/nfs
  2. echo "/opt/test *(rw,no_root_squash)" >> /etc/exports
  3. systemctl start nfs-server.service
  4. mount -t nfs -o hard,timeo=5 127.0.0.1:/opt/test /mnt/nfs
  5. create minimum c program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/statvfs.h>

int main()
{
    struct statvfs sv;
    int rc;

    rc = statvfs("/mnt/nfs", &sv);

    if (rc < 0)
    {
        perror("statvfs");
        exit(1);
    }
    printf("ok...\n");
    return 0;
}
  1. compile with gcc stat.c -o stat
  2. verify mount with ./stat
  3. systemctl stop nfs-server.service
  4. run ./stat again -> it will hang

If I mount the nfs share with:

mount -t nfs -o soft,timeo=5 127.0.0.1:/opt/test /mnt/nfs

and stop nfs service, ./stat will only produce a statvfs: Input/output error error and close it self without hanging.

For sysinfo I could think about two options or a combination of both:

  1. create timeout function only for nfs device
  2. create features, disable nfs by default, but give user the option enable it. In documentation a warning can give a hint.

Maybe a combination of both could be useful to. I would not wounder if this issue can happen with other protocols too, like cifs.

I my personal case option 2 would be enough. When it hangs only when the mount is faulty I have no problem with it, because it affects the system. I use sysinfo in VMs where nfs mounts a critically needed, and when the mount is faulty all my programs in the VM have a problem.

Edit:

In sysinfo I can also confirm that simple example hangs only when nfs is mounted with hard mount options, not with soft options.

Edit 2:

The problem I have described, when the NFS server is no longer accessible, also occurs with a CIFS mount.

The only difference is that the default mount option for CIFS is soft, and for NFS it is hard.

I would therefore recommend installing a feature switch, e.g. netdevs, and depending on whether this is on or not, excluding or including protocols such as CIFS and nfs.

I will create a pull request, to show you what I have in mind with the feature flag.

jb-alvarado avatar Nov 17 '23 07:11 jb-alvarado