rsync-time-backup icon indicating copy to clipboard operation
rsync-time-backup copied to clipboard

rsync errors are not detected when the log does not contain the error

Open fthiery opened this issue 7 years ago • 13 comments

Doing a backup towards a pretty unreliable/slow NFS mount which tends to hog. I added --timeout 10 to the rsync flags, and tried to loop-run the script:

rsync_tmbackup.sh /data/ /backup/rsync/data exclude.txt
while [ $? -ne 0 ]; do
    rsync_tmbackup.sh /data/ /backup/rsync/data exclude.txt
done

However apparently rsync_tmpbackup.sh does return 0 in case of timeout.

rsync error: timeout in data send/receive (code 30) at io.c(820) [sender=3.1.1]
rsync_tmbackup: [WARNING] Rsync reported a warning, please check '/root/.rsync_tmbackup/2017-02-07-182327.log' for more details.
rsync_tmbackup: Backup completed without errors.

fthiery avatar Feb 07 '17 17:02 fthiery

Strange, I cannot replicate this. The script searches for the string "rsync error" in the log file and, if found, display an error and return an error code "1". It's working fine for me:

ssh: connect to host google.com port 22: Network is unreachable
[sender] io timeout after 1 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]
rsync_tmbackup: [ERROR] Rsync reported an error, please check '/home/laurent/.rsync_tmbackup/........

Then echo $? returns 1 as expected. Could you confirm this is still happening for you?

laurent22 avatar May 02 '17 11:05 laurent22

I think that in my case, the timeout happens during rsync (with an open connexion), not before.

fthiery avatar May 03 '17 13:05 fthiery

Are you able to replicate this consistently? If so, please could you give it another try and check the log generated by the application? In particular, does this log includes the string "rsync error:"? Since this is what we use to detect errors - it appears in the output you posted but perhaps not in the log.

laurent22 avatar May 08 '17 13:05 laurent22

Tried again,

rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --timeout 10 --log-file test.log /data/ .
...
some files transferred
[sender] io timeout after 10 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]

The test.log file contains:

2017/05/09 19:02:02 [11613] [sender] io timeout after 10 seconds -- exiting
2017/05/09 19:02:02 [11613] rsync error: timeout in data send/receive (code 30) at io.c(195) [sender=3.1.1]

fthiery avatar May 09 '17 17:05 fthiery

Currently trying with the last rsync_tmbackup.sh variant (i just removed --compress and added --timeout 10 to the flags or the timeout is way too long), and it does report a non-0 error now.

rsync error: timeout in data send/receive (code 30) at io.c(820) [sender=3.1.1]
rsync_tmbackup: [WARNING] Rsync reported a warning. Run this command for more details: grep -E 'rsync:|rsync error:' '/root/.rsync_tmbackup/2017-05-09-192858.log'
root@ms1:/backup# echo $?
1

Did you change anything ? Looks like it's working as expected now.

Thanks for investigating.

Btw should't we define a default timeout ? I waited a long time and my patience ran out before the default timeout.

fthiery avatar May 09 '17 17:05 fthiery

I'm afraid the problem happened again, the script is on the latest commit. The script is launched with a systemd unit:

# /etc/systemd/system/backup-diskarray.service
[Unit]
Description=backup-diskarray

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/rsync_tmbackup \
  --rsync-set-flags "-D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30" \
  --strategy "1:1 30:0" \
  /data/diskarray \
  /backup/diskarray \
  /backup/excluded_patterns.txt
Apr 28 22:00:00 backup rsync_tmbackup[14023]: rsync_tmbackup: Running command:
Apr 28 22:00:00 backup rsync_tmbackup[14023]: rsync_tmbackup: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file '/.
...
Apr 28 22:07:13 backup rsync_tmbackup[14023]: >f+++++++++ random.file
Apr 28 22:26:28 backup rsync_tmbackup[14023]: [generator] io timeout after 31 seconds -- exiting
Apr 28 22:26:28 backup rsync_tmbackup[14023]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync: [sender] write error: Broken pipe (32)
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: grep: /.rsync_tmbackup/2020-04-28-220000.log: No such file or directory
Apr 28 22:30:38 backup rsync_tmbackup[14023]: rsync_tmbackup: Backup completed without errors.
Apr 28 22:30:38 backup systemd[1]: backup-diskarray.service: Succeeded.
Apr 28 22:30:38 backup systemd[1]: Started backup-diskarray.

fthiery avatar Apr 29 '20 07:04 fthiery

When the error appears, the script is still running. NB below i changed the unit type to "simple" wich should better reflect the actual status.

You can see that it takes 18 minutes for the timeout error to appear, and that when it happens there still are the actual rsync process seems to still be running.

root@backup:/backup/diskarray# systemctl status backup-diskarray.service
● backup-diskarray.service - backup-diskarray
   Loaded: loaded (/etc/systemd/system/backup-diskarray.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2020-04-29 08:47:39 CEST; 2h 4min ago
 Main PID: 28480 (bash)
    Tasks: 2 (limit: 4915)
   Memory: 26.3G
   CGroup: /system.slice/backup-diskarray.service
           ├─28480 bash /usr/local/sbin/rsync_tmbackup --rsync-set-flags -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --strateg
           └─28623 rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file /.rsync_tmbackup/2020-04-29-084739.log --exclu
...
Apr 29 10:22:01 backup rsync_tmbackup[28480]: >f+++++++++ somefile 
Apr 29 10:40:13 backup rsync_tmbackup[28480]: [generator] io timeout after 30 seconds -- exiting
Apr 29 10:40:13 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]

root@backup:/backup/diskarray# ps aux | grep rsync
root     28480  0.0  0.0   6776  2376 ?        Ss   08:47   0:00 bash /usr/local/sbin/rsync_tmbackup --rsync-set-flags -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --strategy 1:1 30:0 /data/diskarray /backup/diskarray /backup/excluded_patterns.txt
root     28623 22.6  0.2  90588 77484 ?        D    08:47  28:22 rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file /.rsync_tmbackup/2020-04-29-084739.log --exclude-from /backup/excluded_patterns.txt --link-dest=/backup/diskarray/2020-04-28-220000 -- /data/diskarray/ /backup/diskarray/2020-04-29-084739/
root     29794  0.0  0.0   6208   892 pts/2    S+   10:52   0:00 grep --color=auto rsync

fthiery avatar Apr 29 '20 08:04 fthiery

Actually:

  • it takes approx 18 min for rsync to timeout (despite the --timeout 30 arg)
  • then it takes 1h20 for rsync to change dirs, which errors out with a broken pipe error
  • at this point, rsync_tmbackup says "Backup completed without errors." which is plain wrong
Apr 29 10:22:01 backup rsync_tmbackup[28480]: >f+++++++++ somedir/somefile
Apr 29 10:40:13 backup rsync_tmbackup[28480]: [generator] io timeout after 30 seconds -- exiting
Apr 29 10:40:13 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(204) [generator=3.1.3]
Apr 29 11:57:58 backup rsync_tmbackup[28480]: cd+++++++++ somedir
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync: [sender] write error: Broken pipe (32)
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 29 11:57:58 backup rsync_tmbackup[28480]: rsync_tmbackup: Backup completed without errors.
Apr 29 11:57:58 backup systemd[1]: backup-diskarray.service: Succeeded.

fthiery avatar Apr 29 '20 10:04 fthiery

I think that maybe in some conditions the log file does not get created or the error may not be present. Currently the script does not seem to check the returncode of rsync, i'll try to check it and report.

diff --git a/rsync_tmbackup.sh b/rsync_tmbackup.sh
index 1bc24ee..cc56403 100755
--- a/rsync_tmbackup.sh
+++ b/rsync_tmbackup.sh
@@ -565,6 +565,7 @@ while : ; do
 
        fn_run_cmd "echo $MYPID > $INPROGRESS_FILE"
        eval $CMD
+       CMD_RETURNCODE=$?
 
        # -----------------------------------------------------------------------------
        # Check if we ran out of space
@@ -601,6 +602,8 @@ while : ; do
                fn_log_error "Rsync reported an error. Run this command for more details: grep -E 'rsync:|rsync error:' '$LOG_FILE'"
        elif [ -n "$(grep "rsync:" "$LOG_FILE")" ]; then
                fn_log_warn "Rsync reported a warning. Run this command for more details: grep -E 'rsync:|rsync error:' '$LOG_FILE'"
+       elif [ $CMD_RETURNCODE -ne 0 ]; then
+               fn_log_error "Rsync returned non-zero return code, backup failed."
        else
                fn_log_info "Backup completed without errors."
                if [[ $AUTO_DELETE_LOG == "1" ]]; then

fthiery avatar Apr 29 '20 11:04 fthiery

Actually, i some cases the log path is not right, so the log does not get created at all

rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --timeout 30 --log-file '/.rsync_tmbackup/2020-04-16-220007.log' --exclude-from '/backup/excluded_patterns.txt' --link-dest='/backup/diskarray/2020-04-15-220019' -- '/data/diskarray/' '/backup/diskarray/2020-04-16-220006/'

Apr 16 02:36:35 backup rsync_tmbackup[18967]: grep: /.rsync_tmbackup/2020-04-15-220019.log: No such file or directory

Because the error detection logic seems 100% based on looking at the log, that's probably the issue and my approach above should catch it

fthiery avatar Apr 29 '20 11:04 fthiery

I confirm my changes fix the wrong status interpretation. I'll open a PR

Apr 29 14:51:25 backup rsync_tmbackup[30254]: cd..t...... somedir
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync: [sender] write error: Broken pipe (32)
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync error: timeout in data send/receive (code 30) at io.c(829) [sender=3.1.3]
Apr 29 14:51:25 backup rsync_tmbackup[30254]: rsync_tmbackup: [ERROR] Rsync returned non-zero return code, backup failed.
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Main process exited, code=exited, status=1/FAILURE
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Failed with result 'exit-code'.
Apr 29 14:51:25 backup systemd[1]: backup-diskarray.service: Triggering OnFailure= dependencies.

fthiery avatar Apr 29 '20 12:04 fthiery

Looking at the logfile, the last line is: 2020/04/29 14:51:25 [30406] cd..t...... somefolder

This means that rsync is not writing this error into the logfile

fthiery avatar Apr 29 '20 13:04 fthiery

https://github.com/laurent22/rsync-time-backup/pull/209

fthiery avatar Apr 29 '20 15:04 fthiery