xcp
xcp copied to clipboard
Use NFS hard mount instead of soft mount to avoid RO VMs (or offer option)?
See proposal and testimony from user on forum: https://xcp-ng.org/forum/post/21940
We may also consider changing the default timeout options.
I think it might be interesting to ask the question to Citrix storage guys. We should create an XSO to get their opinion and maybe their reasons about their current choices.
Perhaps I can suggest to always use a unique fsid= export option for each exported path on the nfs server. This ought to be documented in the docs and wiki :)
The thing is that if NFS is served by a cluster (example - PaceMaker), failover event will work flawlessly if NFS is mounted with 'hard' option on the XenServer. Otherwise, VMs will experience a (short) disk loss and the Linux ones will get, by default, a read-only filesystem. The simple workaround is to edit /opt/xensource/sm/nfs.py and modify the line:
options = "soft,proto=%s,vers=%s" % ( to: options = "hard,proto=%s,vers=%s" % (
This is an ugly workaround, but it allows VMs to live, which is more important that the beauty of the hack.
I believe it is possible to add custom NFS mount options when adding a new SR through XOA. Have you tested this?
Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.
Yes, that's why it would require a XAPI modification for this. That's doable :)
I think we should keep the default behavior, but allow an override: this will let people who want to test, to test it.
In theory, we should:
- add an extra parameter in the SR NFS create
- add an extra variable (keeping
soft
by default if no extra paramhard
added) in NFS driver code
That should be it. @ezaton do you want to contribute?
I am not sure I have the Python know-how, but I will make an effort during the next few days. This is a major thing I am carrying with me since XS version 6.1 or so. These were my early NFS clusters days. Nowadays - I have so many NFS clusters in so many locations. So - yeah. I want to contribute. I will see that I can actually do it.
Thanks!
Okay so IIRC, you might indeed check how NFS version is passed down to the driver (from XAPI to the NFS Python file). It's a good start to understand how it works, and then do the same for the hard/soft mount thing :)
edit: @Wescoeur knows a lot about SMAPIv1, so he might assist you on this (if you have questions).
Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.
I thought subsequent mount-options override previous mount options. This is how we can add nfsver=4.1 for example, isn't it. I haven't tried, but it might be worth trying.
This is a quote from 'man 5 nfs':
soft / hard Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails
an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.
NB: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may
mitigate some of the risks of using the soft option.
Look at the comment. I believe that hard should be the default - at least for regular SR. ISO-SR is another thing. I have just forked the code. I will see if I can modify it without exceeding my talent :-)
Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.
Maybe increasing that value could be a less intrustive option and could be supplied without being ignored?
These are meant to mitigate (some of) the problems caused by soft mount, instead of just mounting 'hard'. Look - when it's your virtual machine there, you do not want a momentary network disruption to kill your VMs. The safety of your virtual machines is the key requirement. Soft mount just doesn't provide it.
I have edited nfs.py and NFSSR.py and created a pull request here: https://github.com/xapi-project/sm/pull/485
Thanks. I think you need to add context and explain why hard would be better than soft and what tests you did to have a chance of getting it merged.
I will add all these details in the pull request.
Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.
I just tried in XOA to create a new SR with the "hard" mount option. Seems to stick when looking at the output from mount
.
# mount
example.com:/media/nfs_ssd/3ec42c2f-552c-222f-3d46-4f98613fe2e1 on /run/sr-mount/3ec42c2f-552c-222f-3d46-4f98613fe2e1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=192.168.1.2)
@Gatak if it's the case it's even easier :D
Can you double check it's the correct hard
behavior?
This is a change of behaviour from what I am remembering, however - I have just tested it, and this is true. Consistent across reboots and across detach/reattach - so my patch is (partially) redundant. However - I believe that 'hard' should be the default for VM NFS SRs.
I believe that 'hard' should be the default for VM NFS SRs.
Yes, based on the documentation provided it does seem the safest option.
Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.
What we can do in XO: expose a menu that select "hard" by default. This will encourage hard
by default without changing it into the platform directly.
Does this sound reasonable for you?
Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.
Sounds good. Many use soft because you could not abort/unmount a hard mounted NFS share. But this may be old truths..
What we can do in XO: expose a menu that select "hard" by default. This will encourage
hard
by default without changing it into the platform directly.
I think it is important to mention that the NFS export should use the fsid
* option to create a stable export filesystem ID. Otherwise the ID might change on reboot, which will prevent a share from being re-connected.
*
https://linux.die.net/man/5/exports
What about NFS HA? (regarding fsid
)
What about NFS HA? (regarding
fsid
)
NFS HA maintains fsid. If you setup an NFS cluster, you handle your fsid, or else, it doesn't work very well. For stand-alone systems, the fsid is derived from the device id, but not for clusters.
I wrote some condideration on the forum thread about this issue and report here the post important one. It seems that nfs.py already support user options and those got appended to the default. the mount command kept the last option so if default is soft and user append hard: soft,hard = hard. same for timeo and retrans. the linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.
using hard as default is risky in my opinion. I have to say that usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs. I say it's risky because if you use a lot of different NFS storage and only one goes down for long period you will get a semi-frozen dom0. it's to be verified what happen to xapi and normal operation, if you are able to ignore the 1 broken NFS SR and continue working or the whole xapi or other deamon running on dom0 get stuck at listing mount points or accessing the one broken SR. I think nobody want to reboot an host because the NFS SR for the iso files is down. for short downtime raising the NFS mount option retrans (default 3) or timeo (default 100) could be enough. the ideal solution is to have the single VM retrying on a soft mount without going read-only so it's easy to manually recover the fs without reboot of the host for stale nfs mount point. it seems that windows have a nice default behavior and RHEL should too. the problem could be limited to ubuntu o other distro (to be verified)
he linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.
This is incorrect. All Linux servers I have had the pleasure of working with - RHEL5/6/7, Centos, Oracle Linux, Ubuntu and some more - all of them mount by default with the directive onerror=readonly. You have to explicitly change this behaviour for your Linux to not fail(!) when NFS performs failover with soft mount.
Xapi - and SM-related tasks, are handled independently per-SR - check the logs. I agree that ISO SR should remain soft (although this can crash VMs, but this is less of a problem, because the ISO is originally read-only), so my patch (and the proposed change to the GUI) is to have 'hard' mount option for VM data disks, and 'soft' for ISO SR.
usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs.
According to https://linux.die.net/man/5/nfs the intr
mount option is deprecated. However it should still be possible to kill a process. In this case it must be one on the Xen services reading from the stale NFS share. Not sure how possible it is to kill. Is it tapdisk?
I did one test yesterday with a Windows server VM on a hard
mounted NFS server that i took offline for ~30 minutes. The VM froze and i got NFS timeouts on the XCP-ng server dmesg, but once i started the NFS server the freeze stopped and things went back to normal.
This did not previously work when i had the soft
mount option and had not specified fsid
export option. Then the XCP-ng would not reconnect and wait forever with a stale mount.
I made a test with ubuntu server 19.10. installed with defaults setting without LVM. the fs is mount with the continue behavior by default (as I see on a RHEL7) root@ubuntu01:~# tune2fs -l /dev/xvda2 |grep -i behav Errors behavior: Continue
I tested with a script that update a file every second on the VM. the test consist in exportfs -uav on the NFS server to turn down the share and exportfs -rv to bring it online again. with default SR option soft,timeo=100,retrans=3 the VM does not detect a problem for about 1 minute (I didn't precisily measured time). after 5 minutes of downtime the root fs get remounted read-only. on the xcp host I see that df command block for about 10/20 seconds and return the output. once the NFS come back it's almost istantly mounted.
I repeated the test with retrans=360, I expected that the client didn't received error for a heck of time but I was wrong. after about 5 minutes the root fs of the VM get remounted read-only.
I investigated on the timeout parameter of the disk normaly in /sys/block/sd*/device/timeout but it seems that the xen disk does not export this parameter. I was confident that non having a timeout a default infinite wait was implemented but now I think I was wrong.
I still have to understand what really happen: if the VM get the I/O error from the dom0 and then remount read-only before I expected it (timeo=100 and retrans=360 should retry for about 1 hour) or if the timeout is internal in the kernel of the VM and once exceeded the fs is remounted read-only. the first case means for some reason the NFS paramenter are not enforced while the second case means that even with hard mount you should see the problem. so right now I miss something.
some more test. It turn out that one possible problem was how I conducted the test. I user unxeport/export and this seems to trigger the error reporting to userspace even before the timeout expire. I tried with timeo=3000,retrans=10 but after about 50 seconds the VM mounted read-only and a ls command on xcp host returned error after few seconds instead of waiting. this with unexport/export.
I now tried with null routing as suggested on the forum, ip route add <xcp host ip/32> via 127.0.0.1 dev lo to block all traffic between NFS server and xcp host and then ip route del to rollback. now after 5 minutes the VM does not get error with timeo=3000,retrans=10 and commands on the host like df block, the NFS mount honor the configured timeout.
I'm going to retest with timeo=100,retrans=360 to be sure it works and to verify how the tcp timeouts interact.
I think this tell us 2 things:
- the xvda disk does not have timeouts
- in case of ip failover on the nfs server it should be safer to create the exports and then configure the ip then viceversa. this to let the share appear from the first time the vip is reachable again and avoid error to be propagated to userspace
Just a quick word to say that this discussion is very interesting, whatever what the outcome will be. I'm following it closely.
in case of ip failover on the nfs server it should be safer to create the exports and then configure the ip then viceversa. this to let the share appear from the first time the vip is reachable again and avoid error to be propagated to userspace This is because this is 'soft' mount. For hard mounts, the system would attempt to mount even when the share is not presented on the destination IP.