sanoid Make separate syncoid exit code for transient errors

Make separate syncoid exit code for transient errors

Open amartin3225 opened this issue 2 years ago • 2 comments

I run syncoid in an A -> B -> C configuration and frequently get failures when running because either the target dataset on B "is already target of a zfs receive process" or because "dataset is busy". These are transient problems because A is syncing to B while B is also syncing to C, so I'd like to distinguish between these and real, actual errors that won't resolve themselves on a subsequent run. Would you accept a PR that sets the non-zero exit codes as follows?

exit code 1 - these might be intentional in some cases (e.g. ignore an empty parent dataset when using -r):

warn "CRITICAL: no snapshots exist on source $sourcefs, and you asked for --no-sync-snap.\n";
warn "WARN: --no-sync-snap is set, and getnewestsnapshot() could not find any snapshots on source for current dataset. Continuing.\n";

exit code 2 - as noted in the above A -> B -> C scenario, these could be normal during multiple syncs

warn "Cannot sync now: $targetfs is already target of a zfs receive process.\n";
print "WARN: resetting partially receive state\n";

exit code 3 - anything else (these are actual errors)

By separating out exit codes as outlined above, I could only investigate when syncoid exits with 3 since I know the "dataset is busy" and other transient errors will resolve themselves in a later run.

Sep 29 '21 11:09 amartin3225

I would love to see this as well.

Dec 07 '21 19:12 benyanke

I implemented this in the above PR for these transient errors:

warn "Cannot sync now: $targetfs is already target of a zfs receive process.\n";

After looking into it more closely, I realized that we cannot distinguish the WARN: resetting partially receive state state (aka dataset is busy) from other CRITICAL errors because this error originates in the ZFS command itself (and it doesn't differentiate it with a different exit code); we could search for this string in $stdout and use a different exit code it if it is found, but I'm concerned that this could mask more severe errors (e.g. if both WARN: resetting partially receive state and another more severe error were both printed by the ZFS command).

Moreover the following should be solved for empty parent datasets by setting the syncoid:sync property or using --exclude, so we should maintain existing behavior for them:

warn "CRITICAL: no snapshots exist on source $sourcefs, and you asked for --no-sync-snap.\n";
warn "WARN: --no-sync-snap is set, and getnewestsnapshot() could not find any snapshots on source for current dataset. Continuing.\n";

Feb 22 '22 14:02 amartin3225

sanoid sanoid copied to clipboard

Make separate syncoid exit code for transient errors

sanoid
sanoid copied to clipboard