sysinfo
sysinfo copied to clipboard
`refresh_disks_list` may stuck if statvfs hangs on some NFS device
We found the statfs
syscall may hang on some nfs devices with the following stack when we are trying to get a list of disks by refresh_disks_list
.
[<ffffffffc0939f04>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffc093b6f4>] __rpc_execute+0x154/0x420 [sunrpc]
[<ffffffffc093cf78>] rpc_execute+0x68/0xc0 [sunrpc]
[<ffffffffc092c786>] rpc_run_task+0xf6/0x150 [sunrpc]
[<ffffffffc09e56d3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffc09e65c8>] _nfs4_proc_statfs+0xc8/0xf0 [nfsv4]
[<ffffffffc09f1d66>] nfs4_proc_statfs+0x66/0xa0 [nfsv4]
[<ffffffffc09b024e>] nfs_statfs+0x6e/0x190 [nfs]
[<ffffffff93a76de7>] statfs_by_dentry+0xa7/0x140
[<ffffffff93a76e9b>] vfs_statfs+0x1b/0xc0
[<ffffffff93a77175>] user_statfs+0x55/0xa0
[<ffffffff93a771e7>] SYSC_statfs+0x27/0x60
[<ffffffff93a773ee>] SyS_statfs+0xe/0x10
[<ffffffff93f74ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff
These problems are also reported in https://github.com/prometheus/node_exporter/issues/868.
So maybe in refresh_disks_list
we can avoid calling statfs
if we find this device is a nfs from reading /proc/mounts
.
Thanks for the report. Considering I don't have a NFS device, I'll need someone else to write the fix.
I have a nfs server installed on ubuntu and want to fix this issue. Does that mean that we should remove nfs from the return values of get_all_disks_inner
?
Running cargo run --examples simple
does actually give the following nfs. It hangs if the nfs service is stopped via systemctl
Disk("/dev/vda1")[FS: ['e', 'x', 't', '4']][Type: HDD][removable: no] mounted on "/": 21229899776/52776349696 B
Disk("172.16.0.7:/home/ubuntu/nfsroot")[FS: ['n', 'f', 's', '4']][Type: Unknown(-1)][removable: no] mounted on "/opt/172-16-0-7": 46360690688/52722401280 B
Fixed by #876.
I'm right that this was only happen when nfs mount had an error? In general I found it useful to get stats from nfs mounts. How about adding a timeout function and bring nfs back?
No idea, I don't have an NFS to test. The timeout would be quite small if we don't want it to hang for too long, so not sure if it's a good idea...
I made some quick tests, but would be nice if @CalvinNeo and @kayoch1n could tell in which situations they got the error when refresh_disks_list
hangs on nfs mount.
First I found this issue on ansible, where they had a similar problem. And I also found a quick test solution on redhat page.
To reproduce I follow this steps on a current fedora linux:
-
mkdir /opt/test /mnt/nfs
-
echo "/opt/test *(rw,no_root_squash)" >> /etc/exports
-
systemctl start nfs-server.service
-
mount -t nfs -o hard,timeo=5 127.0.0.1:/opt/test /mnt/nfs
- create minimum c program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/statvfs.h>
int main()
{
struct statvfs sv;
int rc;
rc = statvfs("/mnt/nfs", &sv);
if (rc < 0)
{
perror("statvfs");
exit(1);
}
printf("ok...\n");
return 0;
}
- compile with
gcc stat.c -o stat
- verify mount with
./stat
-
systemctl stop nfs-server.service
- run
./stat
again -> it will hang
If I mount the nfs share with:
mount -t nfs -o soft,timeo=5 127.0.0.1:/opt/test /mnt/nfs
and stop nfs service, ./stat
will only produce a statvfs: Input/output error
error and close it self without hanging.
For sysinfo I could think about two options or a combination of both:
- create timeout function only for nfs device
- create features, disable nfs by default, but give user the option enable it. In documentation a warning can give a hint.
Maybe a combination of both could be useful to. I would not wounder if this issue can happen with other protocols too, like cifs.
I my personal case option 2 would be enough. When it hangs only when the mount is faulty I have no problem with it, because it affects the system. I use sysinfo in VMs where nfs mounts a critically needed, and when the mount is faulty all my programs in the VM have a problem.
Edit:
In sysinfo I can also confirm that simple example hangs only when nfs is mounted with hard
mount options, not with soft
options.
Edit 2:
The problem I have described, when the NFS server is no longer accessible, also occurs with a CIFS mount.
The only difference is that the default mount option for CIFS is soft, and for NFS it is hard.
I would therefore recommend installing a feature switch, e.g. netdevs
, and depending on whether this is on or not, excluding or including protocols such as CIFS and nfs.
I will create a pull request, to show you what I have in mind with the feature flag.