copr icon indicating copy to clipboard operation
copr copied to clipboard

Check quickly that we have Fedora copr-backend backup

Open praiskup opened this issue 1 year ago • 1 comments

The backups should be on storinator box.

praiskup avatar Aug 28 '24 12:08 praiskup

We need a howto document (output from this ticket).

praiskup avatar Aug 28 '24 12:08 praiskup

# for i in $(ls -1 /var/log/cron-*.xz | tac); do xzcat $i | grep rsnapshot; done
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (rsnapshot encountered an error! The program was invoked with these options:)
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push )
Sep 17 21:06:26 copr-be CROND[1470129]: (copr) CMDOUT (ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate)
Sep 17 21:06:27 copr-be CROND[1470129]: (copr) CMDOUT (subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.)
Sep 17 21:06:28 copr-be CROND[1470129]: (copr) CMDEND (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)
Sep 14 01:01:02 copr-be CROND[1470229]: (copr) CMD (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)

praiskup avatar Sep 20 '24 08:09 praiskup

$ lvresize /dev/VG_nfs/copr-be -L +8TB $ xfs_growfs /srv/nfs/copr-be/ $ df -h /srv/nfs/copr-be/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/VG_nfs-copr--be 48T 40T 8.1T 84% /srv/nfs/copr-be

praiskup avatar Sep 20 '24 08:09 praiskup

Running ionice --class=idle /usr/local/bin/rsnapshot_copr_backend manually.

praiskup avatar Sep 20 '24 08:09 praiskup

Still doing the rsync :-( and we seem to run out of space again: /dev/mapper/VG_nfs-copr--be 48T 47T 1.7T 97% /srv/nfs/copr-be

praiskup avatar Sep 24 '24 06:09 praiskup

I would remove the old increments, but that would probably break the current rsnapshot process. I'll keep the sync going for now, and wait for the potential failure (if it really fails, I'll remove old increments, and then restart rsnapshot).

praiskup avatar Sep 24 '24 06:09 praiskup

Ok, going with /bin/rm -rf push.3 push.2 push.1 push.0 first, keeping the last .sync

praiskup avatar Sep 27 '24 11:09 praiskup

[copr@copr-be ~][PROD]$ ionice --class=idle /usr/local/bin/rsnapshot_copr_backend
Warning: Permanently added 'storinator01.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.
building file list ... 
rsync: [sender] opendir "/var/lib/copr/public_html/archive/issues/copr-3016" failed: Permission denied (13)
Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.
rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]

praiskup avatar Oct 05 '24 07:10 praiskup

 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=26461/64855270)
/var/lib/copr/public_html/temp/
/var/lib/copr/public_html/temp/issue-3067/
/var/lib/copr/public_html/usage-2019-08-04/
/var/lib/copr/public_html/usage4/
 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=0/64855270)    rsync: [receiver] stat "var/lib/copr/public_html/temp/issue-3067" (in push) failed: No such file or directory (2)
 33,898,430,947   0%    1.08MB/s    8:17:17 (xfr#69052, to-chk=0/64855270)----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:

rsnapshot encountered an error! The program was invoked with these options:
/bin/rsnapshot -c /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf push 
----------------------------------------------------------------------------
ERROR: Could not write lockfile /srv/nfs/copr-be/copr-be-copr-user/rsnapshot.pid: No space left on device
Traceback (most recent call last):
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 58, in <module>
    _main()
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 51, in _main
    rotate(database)
  File "/srv/nfs/copr-be/copr-be-copr-user/rsnapshot", line 42, in rotate
    subprocess.check_call(cmd)
  File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/bin/rsnapshot', '-c', '/srv/nfs/copr-be/copr-be-copr-user/rsnapshot.conf', 'push']' returned non-zero exit status 1.


sent 34,368,856,397 bytes  received 217,861,583 bytes  272,708.92 bytes/sec
total size is 41,526,683,163,969  speedup is 1,200.65

praiskup avatar Oct 07 '24 11:10 praiskup

Starting with: /dev/mapper/VG_nfs-copr--be 48T 345G 48T 1% /srv/nfs/copr-be

praiskup avatar Oct 07 '24 13:10 praiskup

Hmmm

Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.                                                  
rsync: [sender] write error: Broken pipe (32)                                                                          
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]                                                                                                                                                                         
                                                                                                                       
real    1038m41.824s                                                                                                                                                                                                                          
user    67m3.939s                                                                                                                                                                                                                             
sys     85m51.315s

Eventhough storinator's sshd:

● sshd.service - OpenSSH server daemon
     Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-10-05 22:14:15 UTC; 3 days ago

praiskup avatar Oct 09 '24 06:10 praiskup

13,669,445,228,841  66%   31.94MB/s   58:19:50  Timeout, server storinator01.rdu-cc.fedoraproject.org not responding.

rsync: [sender] write error: Broken pipe (32)
rsync error: unexplained error (code 255) at io.c(848) [sender=3.3.0]

real    11032m7.826s
user    821m15.285s
sys     946m50.620s

praiskup avatar Oct 21 '24 06:10 praiskup

# 5h
ServerAliveInterval 20
ServerAliveCountMax 900
ConnectTimeout 120

Before I tried with 20 / 5 / 60.

praiskup avatar Oct 21 '24 07:10 praiskup

First rsync run finished, and the config above probably helped; so the first backup round is done but we still need to fix ansible.git.

praiskup avatar Oct 29 '24 07:10 praiskup

Fixed: https://pagure.io/fedora-infra/ansible/c/5cffe17cd8856b14fef8b858ba1dd12dfec43dd3 Running again (second increment, deletes seem to be done correctly)

praiskup avatar Oct 30 '24 10:10 praiskup

From triage: let Konflux folks know, let PULP folks know

praiskup avatar Nov 06 '24 06:11 praiskup

Last run started 2024-11-05 07:00 AM, ended 2024-11-07 03:00 AM, after ~44 hours. Succeeded. Transferred 2TB of data, which is the increment since the last run finished (~2024-11-02). IOW 2TB for 6 days, which is ~10TB/month. Hmm.

It doesn't seem that the last backup run hit any "build peak"; :shrug: so the increments might be worse sometimes, if build peaks appear.

Then, since we keep 4 weekly increments, and we need to have space for 5th "in progres" increment, storinator should provide us at least as much space as backend consumes (~30TB) plus space for 5 increments (~12TB) = 42T. The df claims 48T, we can get +16T more (volume group allows it, but we need to ask fedora infra first).

That said, everything seems OK right now -> but we should keep monitoring the next two incremental backups, to be sure that the increments fit well into the backup volume.

praiskup avatar Nov 07 '24 15:11 praiskup

I'd like to take a bit more from the VG: https://pagure.io/fedora-infrastructure/issue/12280

praiskup avatar Nov 11 '24 17:11 praiskup

Enlarged volume +6T. Last backup has been running for 3.5 days already.

praiskup avatar Nov 12 '24 08:11 praiskup