backintime
backintime copied to clipboard
Master issue for "error 23" (rsync returns with exit code 23: Partial transfer due to errors)
This issue is meant to centralize all "error 23" issues to discuss a common solution...
Background
BiT Version 1.4.0 (2023-09-14) introduced the evaluation of rsync
exit codes for better error recognition:
https://github.com/bit-team/backintime/blob/726bffa624d9a6da4ebd13ceb40721e1290e1eaf/CHANGES#L32-L42
Before this change rsync
errors reported only via exit code were hidden and left users with the feeling that everything worked well (but didn't).
Collected reasons for rsync
exit code 23
The final error message in the snapshot log is always
[I] Take snapshot (rsync: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7]) [E] 'rsync' ended with exit code 23: See 'man rsync' for more details
Here are typical snapshot log entries causing exit code 23 (shall be extended here to be extended step by step to get a holistic picture):
-
[I] Take snapshot (rsync: symlink has no referent: "/home/user/Documents/dead-link")
-
[E] Error: rsync: [sender] send_files failed to open "/home/user/Documents/root_only_file.txt": Permission denied (13)
Other known reasons: See https://github.com/bit-team/backintime/issues/1587#issuecomment-1856400319
Current impact
Some of the reasons for exit code 23 may be out of control of the BiT user or are OK to be ignored since not considered as "severe" (eg. dead links).
Despite that BiT snapshots do always fail with an error now.
Even though rsync
treats exit code 23 always as error the BiT user should have the freedom to treat some reasons for exit code 23 as INFO or WARNING only...
Possible solutions
-
Add exit code 23 to the ignore list (as before) and let the user think everything went OK (until looking into the snapshot log) https://github.com/bit-team/backintime/blob/726bffa624d9a6da4ebd13ceb40721e1290e1eaf/common/snapshots.py#L1247-L1252
-
Find and use
rsync
options so that we can ignore minor exit code 23 reasons: https://unix.stackexchange.com/questions/598736/how-to-skip-copying-broken-links-with-rsync-copy-links -
Introduce a new snapshot result state "Warning" (currently we have only "OK" and "with errors) https://github.com/bit-team/backintime/blob/726bffa624d9a6da4ebd13ceb40721e1290e1eaf/common/snapshots.py#L1110-L1131
Challenges
rsync
has several reasons for exit code 23 and always treats it as an error.
Related issues
- #1555
- #1573
It might help or give us some more ideas if we would involve the rsync-folks and -maintainer via their mailing list.
I assume there is a reason why multiple problems are summarized with one single error code.
FYI: I have just posted my questions at the rsync mailing list and also looked into the relevant (C) source code of rsync
https://github.com/WayneD/rsync/blob/2f9b963abaa52e44891180fe6c0d1c2219f6686d/flist.c#L1271C1-L1282C28
but without debugging rsync
(which I won't do ;-) I see no way to find out how to suppress this error (one of many that leads to exit code 23):
symlink has no referent: %s
Internal note: I have examined the rsync
source code and identified these possibles reasons (citing the error msg format string here) leading to exit code 23:
- receive_sums failed
- send_files failed to open %s
- fstat failed
- read errors mapping %s
- change_dir %s failed
- skipping overly long name: %s
- symlink has no referent: %s
- readlink_stat(%s) failed
- skipping file with bogus (zero) st_mode: %s
- skipping symlink with 0-length value: %s
- [%s] cannot convert filename: %s (%s)
- [%s] cannot convert symlink data for: %s (%s)
- get_acl(fname, &sx) < 0 // with no explicit error message!
- get_xattr(fname, &sx) < 0 // with no explicit error message!
- link_stat %s failed
- opendir %s failed
- filename overflows max-path len by %u: %s/%s
- cannot send file with empty name in %s
- readdir(%s)
- link_stat %s failed
- cannot add local filter rules in long-named directory: %s
Ideally, I would prefer the solution where an rsync-23 marks a snapshot as "with warnings" (in addition to "with errors"). Users could then clear the warnings (e.g. by renaming the snapshot).
I don't think it's worth the trouble of second-guessing rsync by trying to figure out why exactly a 23 was returned. Users can do that by looking at the logs.
With my current but limited knowledge I also do prefer "with warnings". But we need to add enough documentation about it. Because this will trigger user to ask questions in Issues and on the mailing list.
They have to investigate the log files etc pp
This is also related to our log-file-parsing issue. When we are able to parse the log file then we are able to give better feedback to the user which type of code 23 caused the warning. But in a first place I would say a link to a docu/faq would be enough in the first place.
Ideally, I would prefer the solution where an rsync-23 marks a snapshot as "with warnings" (in addition to "with errors"). [...] Users can do that by looking at the logs.
This would be my preferred (and perfect) solution too.
Unfortunately
-
our snapshot logs do currently support only
[I]
and[E]
(no warnings) -
the snapshot log viewer does not yet allow filtering on warnings
-
the above mentioned 21 reasons for exit code 23 are contained as
[I]
only in the snapshot logs[I] Take snapshot (rsync: symlink has no referent: "/home/user/Documents/dead-link")
-
rsync
does not provide a reliable way to allow our log parser to recognize the 21 reasons for exit code 23 except parsing the message text so we are not able to easily and reliably mark them as[W]
(without false positives or missed entries due torsync
msg text changes or reasons I have overseen in the code)=> this means users cannot even filter and check the log entries causing exit code 23 easily
User had to scroll through the complete snapshot log or search for keywords (which are determined by the msg strings of the 21
rsync
reasons) -
I am not sure if
rsync
's output may be translated into other languages (which would effectively render log parsing into nightmare ;-)Lukily in the source code I could find no indication for that but I will ask at the rsync mailing list.
All this is IMHO too risky and time consuming ATM (I prefer to prepare the next release for Jan, 2023 and the Qt6 release then).
I propose to
- DONE: Add a FAQ entry for exit code 23 (as @buhtz suggested)
-
DONE: Add exit code 23 to the exclusion list for until we have introduced warnings.
This effectively re-introduces the same behavior as before checking the
rsync
exit codes for errors
FYI: The new BiT release v1.4.3 does
- reintroduce handling rsync exit code 23 as info only (no longer an error when taking a snapshot)
- has an experimental filter now to make finding resync transfer errors easier (even if the are "hidden" as
[I]
nfo in length snapshot logs:
@aryoda
- reintroduce handling rsync exit code 23 as info only (no longer an error when taking a snapshot)
1.4.3 continues to make the snapshot fail:
[I] Take snapshot (rsync: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7])
[I] 'rsync' ended with exit code 23: Partial transfer due to error
[E] Failed to take snapshot 2024-02-06 20:03:01.
What's troubling is that the lock file has not been deleted within 300 seconds after the error has been detected in my user-callback
(SLEEP_THRESHOLD=300 and the script sent an email about that issue):
case "$rsync_return_code" in
0|24)
/usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime --profile "another_profile" backup
;;
23)
# Waiting for the current process to end and clear the lock file
sleep_counter=0
while [[ -e "$snapshot_lock_file" ]]
do
sleep 1
((sleep_counter++))
if [[ $sleep_counter -ge $SLEEP_THRESHOLD ]]; then
(
printf "%s\n" "Called from: $script_parameters"
printf "%s\n" "On: $date"
printf "%s\n" "profile_id: $profile_id"
printf "%s\n" "rsync_return_code: $rsync_return_code"
printf "%s\n" "snapshot_lock_file: $snapshot_lock_file"
) | mail -s "Failed ${host_name}: backintime lock file has not been deleted within $SLEEP_THRESHOLD seconds for the $profile_name profile" [email protected]
exit 1
fi
done
# Retaking the backup
/usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime --profile "$profile_name" backup
;;
*)
# Saving snapshot log
cp --force --preserve=all "${snapshot_log_file}" "${snapshot_log_file}.$(date --iso-8601=minutes)"
(
printf "%s\n" "Called from: $script_parameters"
printf "%s\n" "On: $date"
printf "%s\n" "profile_id: $profile_id"
printf "%s\n" "rsync_return_code: $rsync_return_code"
printf "%s\n" "snapshot_log_file: $snapshot_log_file"
) | mail -s "Failed ${host_name}: backintime rsync_return_code is unknown for the $profile_name profile" [email protected]
exit 1
;;
esac
Here is the matching log in takesnapshot.log
:
WARNING: user-callback returncode: 1
ERROR: Plugin usercallbackplugin processEnd failed:
INFO: Unlock'
WARNING: user-callback returncode: 1
ERROR: Plugin usercallbackplugin processEnd failed:
INFO: Unlock'
INFO: Unlock'
DEBUG: [common/tools.py:1134 keyringSupported] No keyring due to import error.
DEBUG: [common/mount.py:161 Mount.__init__] pw-cache is not running
DEBUG: [common/mount.py:169 Mount.__init__] Call command: /usr/bin/backintime pw-cache start
DEBUG: [plugins/usercallbackplugin.py:81 UserCallbackPlugin.callback] Call user-callback: /root/.config/backintime/user-callback 10 Data 8
INFO: [common/snapshots.py:851 Snapshots.backup] Unlock
Back In Time
Version: 1.4.3
@jean-christophe-manciot
-
Could you please start your backup manually via command line incl. the
--debug
option and send me the anonymized output (it seems the abovetakesnapshot.log
does not show the relevant critical execution path to diagnose this).I will try to find the execution path then that "forgets" to delete the lock file...
-
In the code I can see that the error may also be caused by a different reason than exit code 23 (when the
rsync
log is parsed and an error is detected). Can you find any snapshot log entry that contains the stringrsync:
(which is BiT's way of recognizing and returning anrsync
error)?
Internal note: The error recognition in the snapshot log is here:
https://github.com/bit-team/backintime/blob/9ea0d80db2d2b3398104b071568d07ca39fe85ea/common/snapshots.py#L917-L923
What's troubling is that the lock file has not been deleted within 300 seconds after the error has been detected in my
user-callback
(SLEEP_THRESHOLD=300
- Just to be sure which execution path I a have to debug:
while [[ -e "$snapshot_lock_file" ]]
references theworker*.lock
file? - In which "reason" ($3 of the user callback) your handler code is executed?
I need to check if the
worker*.lock
file is already deleted in this "reason" handler (or do you spawn a separate parallel process to execute your above script?)
@aryoda
In the code I can see that the error may also be caused by a different reason than exit code 23 (when the rsync log is parsed and an error is detected). Can you find any snapshot log entry that contains the string rsync: (which is BiT's way of recognizing and returning an rsync error)?
Actually, I do believe the error has been caused only by rsync error 23; here are the sole errors present in the same log:
[E] Error: rsync: [sender] read errors mapping "/some/file": No data available (61)
[E] Error: rsync: [sender] read errors mapping "/some/other/file": No data available (61)
[E] Error: rsync: [sender] read errors mapping "/some/different/file": No data available (61)
Just to be sure which execution path I a have to debug: while [[ -e "$snapshot_lock_file" ]] references the worker*.lock file?
snapshot_lock_file="/root/.local/share/backintime/worker${profile_id}.lock"
with profile_id=8 here
In which "reason" ($3 of the user callback) your handler code is executed? I need to check if the worker*.lock file is already deleted in this "reason" handler (or do you spawn a separate parallel process to execute your above script?)
$3=2 when the email was sent.
Could you please start your backup manually via command line incl. the --debug option and send me the anonymized output (it seems the above takesnapshot.log does not show the relevant critical execution path to diagnose this). I will try to find the execution path then that "forgets" to delete the lock file...
I will try when I find the time. This issue may be linked to the fact that each successful snapshot (rsync_return_code=0|24) calls another one in the user-callback script without waiting for the lock file to be deleted. I can spot in takesnapshot.log that for each successful snapshot, there is a
ERROR: user-callback returned 'INFO: Lock
INFO: Take a new snapshot. Profile: <number> <name>
and at the end of all snapshots, I get:
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
matching precisely the number of previous ERROR: user-callback returned 'INFO: Lock
+1 (for the last successful snapshot)
Unfortunately, I've just realized that the lock file is not removed as long as the user-callback is running, so the latter can wait forever for the lock file to be deleted before moving on... so it's a catch-22 situation!
The only way out is to delete the lock file inside the user-callback script just before moving on. Let's see what happens.
No luck, it does not work, the whole process is stopped when the lock file is manually deleted. This means that it seems impossible to retake the failed snapshot from within the user-callback after an rsync error 23.
Work around: Relax
rsync
exit code 23: Ignore instead of error now (part of #1587) 1.4.3 continues to make the snapshot fail:
I have "lost" the overall view, do you think there is an open bug in BiT (1.4.4-dev
version, above change is not yet released!) that I should fix?
No luck, it does not work, the whole process is stopped when the lock file is manually deleted. This means that it seems impossible to retake the failed snapshot from within the user-callback after an rsync error 23.
That is strange, I see now obvious indication in our source code why BiT should fail if an existing lock file is deleted.
I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...
This issue may be linked to the fact that each successful snapshot (rsync_return_code=0|24) calls another one in the user-callback script without waiting for the lock file to be deleted.
It looks like you are using the usercallback
to implement special requirements to BiT that are currently available .
If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...
- aryoda @.***> [02-12-24 04:00]:
Work around: Relax
rsync
exit code 23: Ignore instead of error now (part of #1587) 1.4.3 continues to make the snapshot fail:I have "lost" the overall view, do you think there is an open bug in BiT (
1.4.4-dev
version, above change is not yet released!) that I should fix?
I get these, code 23, occasionally on 1.4.4-dev. sporadically over last several months.
Message ID: @.***>
-- (paka)Patrick Shanahan Plainfield, Indiana, USA @ptilopteri http://en.opensuse.org openSUSE Community Member facebook/ptilopteri Photos: http://wahoo.no-ip.org/piwigo paka @ IRCnet oftc
- paka @.***> [02-12-24 08:34]:
- aryoda @.***> [02-12-24 04:00]:
Work around: Relax
rsync
exit code 23: Ignore instead of error now (part of #1587) 1.4.3 continues to make the snapshot fail:I have "lost" the overall view, do you think there is an open bug in BiT (
1.4.4-dev
version, above change is not yet released!) that I should fix?I get these, code 23, occasionally on 1.4.4-dev. sporadically over last several months.
Message ID: @.> Message ID: @.>
[I] Take snapshot (rsync: sent 3.07G bytes received 118.64K bytes 30.90M
bytes/sec)
[I] Take snapshot (rsync: total size is 56.00G speedup is 18.21)
[I] Take snapshot (rsync: rsync warning: some files vanished before they
could be
transferred (code 24) at main.c(1336) [sender=3.2.7])
[I] 'rsync' ended with exit code 24: Partial transfer due to vanished
source files
(see 'man rsync')
[I] Saving config file…
[I] Saving permissions…
1.4.4-def 2349c31ab3317eb0d9ab2dc16a89f186f1099077
-- (paka)Patrick Shanahan Plainfield, Indiana, USA @ptilopteri http://en.opensuse.org openSUSE Community Member facebook/ptilopteri Photos: http://wahoo.no-ip.org/piwigo paka @ IRCnet oftc
- paka @.***> [02-12-24 18:25]:
- paka @.***> [02-12-24 08:34]:
- aryoda @.***> [02-12-24 04:00]:
Work around: Relax
rsync
exit code 23: Ignore instead of error now (part of #1587) 1.4.3 continues to make the snapshot fail:I have "lost" the overall view, do you think there is an open bug in BiT (
1.4.4-dev
version, above change is not yet released!) that I should fix?I get these, code 23, occasionally on 1.4.4-dev. sporadically over last several months.
Message ID: @.> Message ID: @.>
[I] Take snapshot (rsync: sent 3.07G bytes received 118.64K bytes 30.90M bytes/sec) [I] Take snapshot (rsync: total size is 56.00G speedup is 18.21) [I] Take snapshot (rsync: rsync warning: some files vanished before they could be transferred (code 24) at main.c(1336) [sender=3.2.7]) [I] 'rsync' ended with exit code 24: Partial transfer due to vanished source files (see 'man rsync')
[I] Saving config file…
[I] Saving permissions…1.4.4-def 2349c31ab3317eb0d9ab2dc16a89f186f1099077
Message ID: @.***>
uups, this one is code 24.
-- (paka)Patrick Shanahan Plainfield, Indiana, USA @ptilopteri http://en.opensuse.org openSUSE Community Member facebook/ptilopteri Photos: http://wahoo.no-ip.org/piwigo paka @ IRCnet oftc
@aryoda
I have "lost" the overall view, do you think there is an open bug in BiT (
1.4.4-dev
version, above change is not yet released!) that I should fix?
So it seems that the rsync error 23 is no longer treated as an error in 1.4.4-dev
instead of 1.4.3
.
I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...
Urgently is a strong word, I'd rather say ASAP, and definitely before very far away, someday, until "never"
...
It looks like you are using the
usercallback
to implement special requirements to BiT that are currently available .If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...
I think you mean "special requirements to BiT that are currently NOT available" (or did I missed something?). You seem to refer to the ability to chain multiple profiles. I'll try to open a new FR about it.
@aryoda
I have "lost" the overall view, do you think there is an open bug in BiT (
1.4.4-dev
version, above change is not yet released!) that I should fix?So it seems that the rsync error 23 is no longer treated as an error in
1.4.4-dev
instead of1.4.3
.
Yes. I hope this does not cause new problems in your workflow (mainly: user callback script), my goal was to relax this again until we introduce "warnings" as third "take snapshot" result type...
I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...
Urgently is a strong word, I'd rather say ASAP, and definitely before
very far away, someday, until "never"
...
OK, I will check the source code next week to estimate how much work (and possibly risks) this implies...
It looks like you are using the
usercallback
to implement special requirements to BiT that are currently available . If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...I think you mean "special requirements to BiT that are currently NOT available" (or did I missed something?). You seem to refer to the ability to chain multiple profiles. I'll try to open a new FR about it.
Ah yes, I forgot the word "not" and meant indeed "currently not available" :-)
If you think your requirements help to improve BiT and are also helpful for others a PR is really welcome!
@jean-christophe-manciot Would it be possible to help me finalizing the requirements for #1591 (esp. how to cope with user callback calls in case of errors). I continue the conversation related to "retries" in #1591 ...
Yes. I hope this does not cause new problems in your workflow (mainly: user callback script), my goal was to relax this again until we introduce "warnings" as third "take snapshot" result type...
No worries. I'll adapt.