nfs-ganesha
nfs-ganesha copied to clipboard
Q: why "add_clid" in the "struct nfs4_recovery_backend" returns void instead of a return value?
For HA setup, there are several candidate recovery backend implementations. However, it seems that the function "add_clid" does not check the return value. When using a remote storage as recovery backend such as "rados_ng", or even using local filesystem, it is possible to encounter a failure. So in that case, am I going to lose this client record during failover, which means the reclaim from this client will be rejected?
void (*add_clid)(nfs_client_id_t *);
Look forward to hearing from you. Thanks in advance!
We probably should check for failure... It wouldn't be too hard to do that.
What would we do if there was a failure? Fail the client? What is the relative importance of recovery vs. getting actual work done?
@ffilz @dang Thanks for the reply! So my question is what are the suggested actions to take if we encountered such a failure? Or is there any data consistency issues under such circumstance? In other words, if the side effect of such a failure is just losing some un-fsynced data, it is acceptable and makes sense. But if it is going to lose some fsynced data, then it has to be taken care of. Please correct me if I misunderstood anything. Thanks in advance!
No fsynced data would be lost exactly since that isn't dependent on state recovery. What could be lost is byte range locks, so if a client was in the middle of a complex write transaction protected by locks and it wasn't able to reclaim it's lock, it may have only written part of the update.
Now something to consider on failing the client is what happens if the client is trying to establish a new clientid after Ganesha failure and we have this kind of failure. Then we have locked the client out from reclaim a different way. So this isn't just keeping the client from starting possibly unrecoverable state.
The question is what is the nature of the failures that cause us not to register and what is the best way to secure the system as a whole when those failures occur. One failure is simply the admin storage used for the recovery database filling up. Another failure is that storage having errors. I don't know if there are any other errors.
One option is that maybe these should be fatal errors.
Looks like there's some consensus to return NFS4ERR_SERVERFAULT in this case instead of blindly proceeding.
Patch submitted https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553781?usp=search
There is some question as to whether we really want to do this, patch not merged yet.
Is this still an issue?