Check quickly that we have Fedora copr-backend backup
The backups should be on storinator box.
We need a howto document (output from this ticket).
# for i in $(ls -1 /var/log/cron-*.xz | tac); do xzcat $i | grep rsnapshot; done
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (rsnapshot encountered an error! The program was invoked with these options:)
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push )
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT ( File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT ( File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT ( File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.)
Sep 17 21:06:28 copr-be CROND[1470129]: (copr) CMDEND (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)
Sep 14 01:01:02 copr-be CROND[1470229]: (copr) CMD (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)
$ lvresize /dev/VG_nfs/copr-be -L +8TB $ xfs_growfs /srv/nfs/copr-be/ $ df -h /srv/nfs/copr-be/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/VG_nfs-copr--be 48T 40T 8.1T 84% /srv/nfs/copr-be
Running ionice --class=idle /usr/local/bin/rsnapshot_copr_backend manually.
Still doing the rsync :-( and we seem to run out of space again:
/dev/mapper/VG_nfs-copr--be 48T 47T 1.7T 97% /srv/nfs/copr-be
I would remove the old increments, but that would probably break the current rsnapshot process. I'll keep the sync going for now, and wait for the potential failure (if it really fails, I'll remove old increments, and then restart rsnapshot).
Ok, going with /bin/rm -rf push.3 push.2 push.1 push.0 first, keeping the last .sync
[copr@copr-be ~][PROD]$ ionice --class=idle /usr/local/bin/rsnapshot_copr_backend
Warning: Permanently added 'storinator01.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.
building file list ...
rsync: [sender] opendir "/var/lib/copr/public_html/archive/issues/copr-3016" failed: Permission denied (13)
Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]
33,898,430,947 0% 1.08MB/s 8:17:17 (xfr#69052, to-chk=26461/64855270)
/var/lib/copr/public_html/temp/
/var/lib/copr/public_html/temp/issue-3067/
/var/lib/copr/public_html/usage-2019-08-04/
/var/lib/copr/public_html/usage4/
33,898,430,947 0% 1.08MB/s 8:17:17 (xfr#69052, to-chk=0/64855270) rsync: [receiver] stat "var/lib/copr/public_html/temp/issue-3067" (in push) failed: No such file or directory (2)
33,898,430,947 0% 1.08MB/s 8:17:17 (xfr#69052, to-chk=0/64855270)----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
rsnapshot encountered an error! The program was invoked with these options:
/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push
----------------------------------------------------------------------------
ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device
Traceback (most recent call last):
File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>
_main()
File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main
rotate(database)
File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate
subprocess.check_call(cmd)
File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.
sent 34,368,856,397 bytes received 217,861,583 bytes 272,708.92 bytes/sec
total size is 41,526,683,163,969 speedup is 1,200.65
Starting with: /dev/mapper/VG_nfs-copr--be 48T 345G 48T 1% /srv/nfs/copr-be
Hmmm
Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]
real 1038m41.824s
user 67m3.939s
sys 85m51.315s
Eventhough storinator's sshd:
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
Active: active (running) since Sat 2024-10-05 22:14:15 UTC; 3 days ago
13,669,445,228,841 66% 31.94MB/s 58:19:50 Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]
real 11032m7.826s
user 821m15.285s
sys 946m50.620s
# 5h
ServerAliveInterval 20
ServerAliveCountMax 900
ConnectTimeout 120
Before I tried with 20 / 5 / 60.
First rsync run finished, and the config above probably helped; so the first backup round is done but we still need to fix ansible.git.
Fixed: https://pagure.io/fedora-infra/ansible/c/5cffe17cd8856b14fef8b858ba1dd12dfec43dd3 Running again (second increment, deletes seem to be done correctly)
From triage: let Konflux folks know, let PULP folks know
Last run started 2024-11-05 07:00 AM, ended 2024-11-07 03:00 AM, after ~44 hours. Succeeded. Transferred 2TB of data, which is the increment since the last run finished (~2024-11-02). IOW 2TB for 6 days, which is ~10TB/month. Hmm.
It doesn't seem that the last backup run hit any "build peak"; :shrug: so the increments might be worse sometimes, if build peaks appear.
Then, since we keep 4 weekly increments, and we need to have space for 5th "in progres" increment, storinator should provide us at least as much space as backend consumes (~30TB) plus space for 5 increments (~12TB) = 42T. The df claims 48T, we can get +16T more (volume group allows it, but we need to ask fedora infra first).
That said, everything seems OK right now -> but we should keep monitoring the next two incremental backups, to be sure that the increments fit well into the backup volume.
I'd like to take a bit more from the VG: https://pagure.io/fedora-infrastructure/issue/12280
Enlarged volume +6T. Last backup has been running for 3.5 days already.