rsync-time-backup
rsync-time-backup copied to clipboard
rsync errors are not detected when the log does not contain the error
Doing a backup towards a pretty unreliable/slow NFS mount which tends to hog. I added --timeout 10 to the rsync flags, and tried to loop-run the script:
rsync_tmbackup.sh /data/ /backup/rsync/data exclude.txt
while [ $? -ne 0 ]; do
rsync_tmbackup.sh /data/ /backup/rsync/data exclude.txt
done
However apparently rsync_tmpbackup.sh does return 0 in case of timeout.
rsync error: timeout in data send/receive (code 30) at io.c(820) [sender=3.1.1]
rsync_tmbackup: [WARNING] Rsync reported a warning, please check '/root/.rsync_tmbackup/2017-02-07-182327.log' for more details.
rsync_tmbackup: Backup completed without errors.
Strange, I cannot replicate this. The script searches for the string "rsync error" in the log file and, if found, display an error and return an error code "1". It's working fine for me:
ssh: connect to host google.com port 22: Network is unreachable
[sender] io timeout after 1 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]
rsync_tmbackup: [ERROR] Rsync reported an error, please check '/home/laurent/.rsync_tmbackup/........
Then echo $?
returns 1
as expected. Could you confirm this is still happening for you?
I think that in my case, the timeout happens during rsync (with an open connexion), not before.
Are you able to replicate this consistently? If so, please could you give it another try and check the log generated by the application? In particular, does this log includes the string "rsync error:"? Since this is what we use to detect errors - it appears in the output you posted but perhaps not in the log.
Tried again,
rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --timeout 10 --log-file test.log /data/ .
...
some files transferred
[sender] io timeout after 10 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]
The test.log file contains:
2017/05/09 19:02:02 [11613] [sender] io timeout after 10 seconds -- exiting
2017/05/09 19:02:02 [11613] rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]
Currently trying with the last rsync_tmbackup.sh variant (i just removed --compress and added --timeout 10 to the flags or the timeout is way too long), and it does report a non-0 error now.
rsync error: timeout in data send/receive (code 30) at io.c(820) [sender=3.1.1]
rsync_tmbackup: [WARNING] Rsync reported a warning. Run this command for more details: grep -E 'rsync:|rsync error:' '/root/.rsync_tmbackup/2017-05-09-192858.log'
root@ms1:/backup# echo $?
1
Did you change anything ? Looks like it's working as expected now.
Thanks for investigating.
Btw should't we define a default timeout ? I waited a long time and my patience ran out before the default timeout.
I'm afraid the problem happened again, the script is on the latest commit. The script is launched with a systemd unit:
# /etc/systemd/system/backup-diskarray.service
[Unit]
Description=backup-diskarray
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/rsync_tmbackup \
--rsync-set-flags "-D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30" \
--strategy "1:1 30:0" \
/data/diskarray \
/backup/diskarray \
/backup/excluded_patterns.txt
Apr 28 22:00:00 backup rsync_tmbackup[14023]: rsync_tmbackup: Running command:
Apr 28 22:00:00 backup rsync_tmbackup[14023]: rsync_tmbackup: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file '/.
...
Apr 28 22:07:13 backup rsync_tmbackup[14023]: >f+++++++++ random.file
Apr 28 22:26:28 backup rsync_tmbackup[14023]: [generator] io timeout after 31 seconds -- exiting
Apr 28 22:26:28 backup rsync_tmbackup[14023]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync: [sender] write error: Broken pipe (32)
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync_tmbackup: Backup completed without errors.
Apr 28 22:30:38 backup systemd[1]: backup-diskarray.service: Succeeded.
Apr 28 22:30:38 backup systemd[1]: Started backup-diskarray.
When the error appears, the script is still running. NB below i changed the unit type to "simple" wich should better reflect the actual status.
You can see that it takes 18 minutes for the timeout error to appear, and that when it happens there still are the actual rsync process seems to still be running.
root@backup:/backup/diskarray# systemctl status backup-diskarray.service
● backup-diskarray.service - backup-diskarray
Loaded: loaded (/etc/systemd/system/backup-diskarray.service; static; vendor preset: enabled)
Active: active (running) since Wed 2020-04-29 08:47:39 CEST; 2h 4min ago
Main PID: 28480 (bash)
Tasks: 2 (limit: 4915)
Memory: 26.3G
CGroup: /system.slice/backup-diskarray.service
├─28480 bash /usr/local/sbin/rsync_tmbackup --rsync-set-flags -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --strateg
└─28623 rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file /.rsync_tmbackup/2020-04-29-084739.log --exclu
...
Apr 29 10:22:01 backup rsync_tmbackup[28480]: >f+++++++++ somefile
Apr 29 10:40:13 backup rsync_tmbackup[28480]: [generator] io timeout after 30 seconds -- exiting
Apr 29 10:40:13 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]
root@backup:/backup/diskarray# ps aux | grep rsync
root 28480 0.0 0.0 6776 2376 ? Ss 08:47 0:00 bash /usr/local/sbin/rsync_tmbackup --rsync-set-flags -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --strategy 1:1 30:0 /data/diskarray /backup/diskarray /backup/excluded_patterns.txt
root 28623 22.6 0.2 90588 77484 ? D 08:47 28:22 rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file /.rsync_tmbackup/2020-04-29-084739.log --exclude-from /backup/excluded_patterns.txt --link-dest=/backup/diskarray/2020-04-28-220000 -- /data/diskarray/ /backup/diskarray/2020-04-29-084739/
root 29794 0.0 0.0 6208 892 pts/2 S+ 10:52 0:00 grep --color=auto rsync
Actually:
- it takes approx 18 min for rsync to timeout (despite the --timeout 30 arg)
- then it takes 1h20 for rsync to change dirs, which errors out with a broken pipe error
- at this point, rsync_tmbackup says "Backup completed without errors." which is plain wrong
Apr 29 10:22:01 backup rsync_tmbackup[28480]: >f+++++++++ somedir/somefile
Apr 29 10:40:13 backup rsync_tmbackup[28480]: [generator] io timeout after 30 seconds -- exiting
Apr 29 10:40:13 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]
Apr 29 11:57:58 backup rsync_tmbackup[28480]: cd+++++++++ somedir
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync: [sender] write error: Broken pipe (32)
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync_tmbackup: Backup completed without errors.
Apr 29 11:57:58 backup systemd[1]: backup-diskarray.service: Succeeded.
I think that maybe in some conditions the log file does not get created or the error may not be present. Currently the script does not seem to check the returncode of rsync, i'll try to check it and report.
diff --git a/rsync_tmbackup.sh b/rsync_tmbackup.sh
index 1bc24ee..cc56403 100755
--- a/rsync_tmbackup.sh
+++ b/rsync_tmbackup.sh
@@ -565,6 +565,7 @@ while : ; do
fn_run_cmd "echo $MYPID > $INPROGRESS_FILE"
eval $CMD
+ CMD_RETURNCODE=$?
# -----------------------------------------------------------------------------
# Check if we ran out of space
@@ -601,6 +602,8 @@ while : ; do
fn_log_error "Rsync reported an error. Run this command for more details: grep -E 'rsync:|rsync error:' '$LOG_FILE'"
elif [ -n "$(grep "rsync:" "$LOG_FILE")" ]; then
fn_log_warn "Rsync reported a warning. Run this command for more details: grep -E 'rsync:|rsync error:' '$LOG_FILE'"
+ elif [ $CMD_RETURNCODE -ne 0 ]; then
+ fn_log_error "Rsync returned non-zero return code, backup failed."
else
fn_log_info "Backup completed without errors."
if [[ $AUTO_DELETE_LOG == "1" ]]; then
Actually, i some cases the log path is not right, so the log does not get created at all
rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file '/.rsync_tmbackup/2020-04-16-220007.log' --exclude-from '/backup/excluded_patterns.txt' --link-dest='/backup/diskarray/2020-04-15-220019' -- '/data/diskarray/' '/backup/diskarray/2020-04-16-220006/'
Apr 16 02:36:35 backup rsync_tmbackup[18967]: grep: /.rsync_tmbackup/2020-04-15-220019.log: No such file or directory
Because the error detection logic seems 100% based on looking at the log, that's probably the issue and my approach above should catch it
I confirm my changes fix the wrong status interpretation. I'll open a PR
Apr 29 14:51:25 backup rsync_tmbackup[30254]: cd..t...... somedir
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync: [sender] write error: Broken pipe (32)
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync_tmbackup: [ERROR] Rsync returned non-zero return code, backup failed.
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Main process exited, code=exited, status=1/FAILURE
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Failed with result 'exit-code'.
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Triggering OnFailure= dependencies.
Looking at the logfile, the last line is:
2020/04/29 14:51:25 [30406] cd..t...... somefolder
This means that rsync is not writing this error into the logfile
https://github.com/laurent22/rsync-time-backup/pull/209