nfs-ganesha icon indicating copy to clipboard operation
nfs-ganesha copied to clipboard

Q: why "add_clid" in the "struct nfs4_recovery_backend" returns void instead of a return value?

Open shuoranliu opened this issue 1 year ago • 8 comments

For HA setup, there are several candidate recovery backend implementations. However, it seems that the function "add_clid" does not check the return value. When using a remote storage as recovery backend such as "rados_ng", or even using local filesystem, it is possible to encounter a failure. So in that case, am I going to lose this client record during failover, which means the reclaim from this client will be rejected?

void (*add_clid)(nfs_client_id_t *);

Look forward to hearing from you. Thanks in advance!

shuoranliu avatar May 05 '23 07:05 shuoranliu

We probably should check for failure... It wouldn't be too hard to do that.

ffilz avatar May 05 '23 23:05 ffilz

What would we do if there was a failure? Fail the client? What is the relative importance of recovery vs. getting actual work done?

dang avatar May 08 '23 13:05 dang

@ffilz @dang Thanks for the reply! So my question is what are the suggested actions to take if we encountered such a failure? Or is there any data consistency issues under such circumstance? In other words, if the side effect of such a failure is just losing some un-fsynced data, it is acceptable and makes sense. But if it is going to lose some fsynced data, then it has to be taken care of. Please correct me if I misunderstood anything. Thanks in advance!

shuoranliu avatar May 08 '23 14:05 shuoranliu

No fsynced data would be lost exactly since that isn't dependent on state recovery. What could be lost is byte range locks, so if a client was in the middle of a complex write transaction protected by locks and it wasn't able to reclaim it's lock, it may have only written part of the update.

Now something to consider on failing the client is what happens if the client is trying to establish a new clientid after Ganesha failure and we have this kind of failure. Then we have locked the client out from reclaim a different way. So this isn't just keeping the client from starting possibly unrecoverable state.

The question is what is the nature of the failures that cause us not to register and what is the best way to secure the system as a whole when those failures occur. One failure is simply the admin storage used for the recovery database filling up. Another failure is that storage having errors. I don't know if there are any other errors.

One option is that maybe these should be fatal errors.

ffilz avatar May 08 '23 15:05 ffilz

Looks like there's some consensus to return NFS4ERR_SERVERFAULT in this case instead of blindly proceeding.

ffilz avatar May 11 '23 18:05 ffilz

Patch submitted https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553781?usp=search

ffilz avatar May 11 '23 20:05 ffilz

There is some question as to whether we really want to do this, patch not merged yet.

ffilz avatar Jun 13 '23 23:06 ffilz

Is this still an issue?

ffilz avatar Aug 23 '24 19:08 ffilz