borg Remote backup hangs indefinitely every time a specific file is reached

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Client: Raspberry Pi 4B, 4GB RAM, on kernel 5.10.52-v7l+ (64bit CPU, 32bit OS) Remote: Oracle VPS, 1GB RAM, on kernel 5.8.0-1037-oracle (x64 system)

Your borg version (borg -V).

Both the client and the server are using borg 1.1.17

Operating system (distribution) and version.

Client: Raspbian GNU/Linux 10 (buster) armv7l (RaspiOS Lite - 32bit version) Remote: Ubuntu Server 20.04.2 LTS x86_64

Hardware / network configuration, and filesystems used.

Client: backing up from a 1TB HDD connected via USB 3.0, formatted as ext4 (pushing to the remote). Remote: receiving files to its SSD, also formatted as ext4 The client (Pi) is on my home network, connecting to the remote (VPS) in Germany using borg

How much data is handled by borg?

Total size of files being pushed is 26GiB, consisting of ~11400 photos.

Full borg commandline that lead to the problem (leave away excludes and passwords)

export BORG_REPO=pesav@vps-alpha:/backups/raspivault
export BORG_PASSPHRASE=<repo_password>
export BORG_FILES_CACHE_TTL=50
export BORG_HOSTNAME_IS_UNIQUE=yes

borg create --compression zstd,9 ::'{now}' /mnt/disk-storage/Foto

Describe the problem you're observing.

This is a very strange one. Every time I try to do a backup (script above), it starts successfully, acquiring the lock and starting to send files. When it reaches a certain folder, which is after around ~23GB of processed files, it hangs indefinitely* (without any output, even with --verbose), until I force a ^C, then it stops with "Failed to create/acquire the lock (timeout)" and "Connection closed by remote" *. I can do many backup attempts, and all fail when reaching files inside a certain folder (I've done 8 attempts, and 3 failed on one file, 3 on a second file and 2 on another file, all "close" to one another). The strange thing is that those files are photos, and I can open them so they're definitely not corrupted... and also, why should borg "fail to acquire the lock" when it has been doing the backup until that point?

Note: this could be a network issue, but then it should not so consistently fail with the same files every time, or at least not if they aren't corrupted. There is another borg repo (on a local drive) to which I backup the same files, and that one doesn't fail, so that further proves it's not a problem with those particular files.

When I said "it hangs indefinitely" I left it in that state for at least 18 hours and nothing happened, and given the fact the "incriminated" files are 3MB large I think it's impossible it could have recovered.
Also, I think that "Connection closed by remote" is caused by the fact I am pushing to the remote, so when I do a ^C the process stops, thus also terminating the borg serve command on the remote, so I don't think this is the problem.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

This problem occurs every time I try to backup to the remote repo, even after deleting it and recreating a new one (then doing a backup with the commands pasted above). The command I use to create the repo is borg init --encryption=repokey pesav@vps-alpha:/backups/raspivault

Include any warning/errors/backtraces from the system logs

Nothing gets logged to syslog, both from the client side and the server side.

Aug 20 '21 20:08 pesaventofilippo

You use a rather large FILES_CACHE_TTL of 50 - do you really need that?

Also, zstd,9 is a rather high (expensive) compression - do you run this on already compressed photos?

borg does not use syslog. log output is on stderr.

Something you could try on the problematic repo:

make sure no borg is / will run against that repo
use borg break-lock on that repo
try the problematic backup again - does it work now?

Aug 21 '21 13:08 ThomasWaldmann

I don't actually need the large files cache - I just set that because I saw another user having problems with files, so I tried to set a larger cache to see if I could solve the problem. It didn't change anything, so I might as well revert to the default value. Regarding compression, I set it quite high because my upload speed to the remote is the limiting factor here, so I prefer to "spend" more CPU time on the client side and shrink as much upload time as possible. As the files are all photos I don't think it makes an (appreciable) difference, but I figured something is better than nothing. (Could this be a cause of the problem, trying to compress already compressed data? I could try to do a backup without compression if it's useful)

Also, I failed to mention I was monitoring the backup from start to end from an interactive terminal (later even with --verbose, --progress and --stats) - so no error was written to stderr (or stdout), except the above "Connection closed by remote" and "Failed to acquire lock" :/ I already tried doing the steps you suggested, sadly with no success - the backup stars fine, so it can acquire the lock, until it reaches those files and then outputs the same error. After I stop the backup with ^C, the lock doesn't get removed, but even if I do borg break-lock and retry the backup, the same problem occurs.

Aug 22 '21 09:08 pesaventofilippo

Update: I tried to do a backup with the following commands:

export BORG_REPO=pesav@vps-alpha:/backups/raspivault
export BORG_PASSPHRASE=<repo_password>
export BORG_HOSTNAME_IS_UNIQUE=yes

borg create --compression none --verbose --progress --stats ::'{now}' /mnt/disk-storage/Foto

The outcome didn't change: it hangs on the same files as above, except when I kill the backup with ^C it doesn't write "Failed to acquire the lock" anymore. I found out why: in previous tests I had a borg prune command running after the backup, which I completely missed. Of course it fails to acquire the lock, because interrupting the backup in progress doesn't remove the previously created lock.

The problem remains though: the backup hangs on the same files every time with no apparent reason, until you kill it with ^C. I'm renaming the issue title, as "Failed to acquire the lock" was raised by borg prune, not the problematic borg create.

Aug 22 '21 09:08 pesaventofilippo

When borg is apparently stuck, does top show borg using any cpu? Also, can you run with --debug instead of --verbose.

Aug 22 '21 15:08 jdchristensen

When borg gets stuck the CPU usage drops to 0, and then it stays there. I've tried running it with --debug: when the backup starts it prints a bunch of information, saying security: repository checks ok, allowing access before starting to check files. When it gets stuck and I terminate it with ^C, the only error logs regard KeyboardInterrupt and Connection closed by host, which are caused by me killing the backup, but nothing relevant to this issue.

I'll attach here the output logs, in case you want to check them (you can see the ^C when I killed it): https://hastebin.com/raw/zoxogajuqi

Aug 22 '21 16:08 pesaventofilippo

Update: while trying to understand why it got stuck every time on the same 2-3 files, I tried to move the directory containing those files, so during the backup borg wouldn't encounter them (basically I moved the folder to another path). The result was still Borg getting stuck in the same way, just on other files. So at this point I think it's not a matter of specific files, but rather "how many" files (or bytes) it has passed, if it makes sense.

I noticed it gets stuck after ~11150 files processed (out of ~11400), but that number always varies by plus or minus 5 files (still pretty strange it's always in that range - I don't know how this could be happening)

Aug 22 '21 17:08 pesaventofilippo

Can you try to first remove BORG_FILES_CACHE_TTL and then make a small backup of something (just a few files, so it'll work). Use the same account for the borg client as usual.

Then try again your normal (big) backup.

Your raspi does not have huge amounts of RAM, but the files cache (esp. with a high TTL) might use quite some memory.

By doing what I suggest, you may get rid of quite some memory use because it will remove beyond-TTL entries from the files cache.

Also, monitor the RAM / swap usage while borg is running.

Checking CPU temperature of the raspi while borg is running might be also useful.

Do similar on the borg server, 1GB RAM is not much. Check dmesg / syslog there whether the OOM killer kills borg.

Aug 22 '21 20:08 ThomasWaldmann

The hastebin url given above does not work.

Aug 22 '21 20:08 ThomasWaldmann

The hastebin url given above does not work.

I've redone a normal backup (all files and without BORG_FILES_CACHE_TTL) to re-record the debug logs, and in the meanwhile I recorded some information:

The logs https://haste.pesaventofilippo.com/raw/kikisareme (this won't go down lol)
CPU temps on the Raspi never went above 53°C
Max RAM used by borg on the client side (raspi) was 1% of ram, or about 37.6MB, with lots to spare (total usage was 74%)
Max RAM used by borg on the remote side was 3.5%, or about 34.3MB (total usage was 45%)
Neither dmesg or syslog reported any memory-related issue, both on the client and on the remote

I'll now try to run a small backup as you suggested, and I'll keep both systems monitored

Aug 23 '21 07:08 pesaventofilippo

I have now run the small test backup and a normal one, and while the small one did finish, the normal one got stuck as always after the ~11150 file mark. CPU and RAM usage was pretty much the same, if not equal, to my comment above in both runs (however, when the backup gets stuck, CPU usage drops to zero, while RAM usage stays where it was - the same as before).

I also noticed that if you interrupt a backup with ^C while it's running, it usually takes 2 or 3 seconds to stop, because it removes the lock (and probably saves a checkpoint, but I'm not sure). On the other hand, when this backup gets stuck and I interrupt it, it quits immediately without doing any operation... which is strange, but I can't understand why.

Here's logs from both runs, they're pretty similar to the one linked above but I'll attach them in case you need them:

Small backup: https://haste.pesaventofilippo.com/raw/elaweketum
Normal backup: https://haste.pesaventofilippo.com/raw/ebetacuquc

Aug 23 '21 07:08 pesaventofilippo

OK, guess borg's RAM usage / RAM availability is not the problem here.

I'll copy the interesting parts of the "normal backup" to here, so it is better visible and available:

Processing files ...
^CRemoteRepository: 252 B bytes sent, 2.76 kB bytes received, 6 messages sent
Connection closed by remote host
Traceback (most recent call last):
  File "borg/archiver.py", line 177, in wrapper
  File "borg/archiver.py", line 598, in do_create
  File "borg/archiver.py", line 563, in create_inner
  File "borg/archive.py", line 491, in save
  File "borg/archive.py", line 285, in flush
  File "borg/archive.py", line 301, in write_chunk
  File "borg/cache.py", line 916, in add_chunk
  File "borg/crypto/key.py", line 370, in encrypt
  File "borg/crypto/nonces.py", line 72, in ensure_reservation
  File "borg/crypto/nonces.py", line 39, in get_repo_free_nonce
  File "borg/remote.py", line 477, in do_rpc
  File "borg/remote.py", line 712, in call
  File "borg/remote.py", line 807, in call_many
  File "borg/helpers.py", line 2274, in handler
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "borg/archiver.py", line 4703, in main
  File "borg/archiver.py", line 4635, in run
  File "borg/archiver.py", line 177, in wrapper
  File "borg/remote.py", line 634, in __exit__
  File "borg/remote.py", line 477, in do_rpc
  File "borg/remote.py", line 712, in call
  File "borg/remote.py", line 814, in call_many
borg.remote.ConnectionClosed: Connection closed by remote host

Platform: Linux raspivault 5.10.52-v7l+ #1441 SMP Tue Aug 3 18:11:56 BST 2021 armv7l
Linux: Unknown Linux  
Borg: 1.1.17  Python: CPython 3.8.6 msgpack: 0.5.6.+borg1
PID: 3351  CWD: /home/pi
sys.argv: ['borg', 'create', '--compression', 'none', '--exclude', '*.mp4', '--debug', '--progress', '::{now}', '/mnt/disk-storage/Foto']
SSH_ORIGINAL_COMMAND: None

Considering it already has reached borg.archive:491 (save), it was either at the end (saving the final archive) or at a checkpoint (saving a checkpoint archive, usually every 30 minutes).

While doing that, it ran out of reserved nonces in borg.crypto.nonces:72 (ensure_reservation) and thus needed to get a fresh block of reserved nonces.

While trying to do that, you interrupted it at borg.crypto.nonces:39 (in get_repo_free_nonce), which does a remote procedure call to the server side borg to read the REPO_DIRECTORY/nonce file, which should contain a hex number as plain text.

Can you please do a ls -l and cat on that file?

Also remarkable is the borg.remote.ConnectionClosed: Connection closed by remote host, but I am not sure whether that is:

the reason why it was stuck
a consequence of you hitting Ctrl-C

Aug 23 '21 13:08 ThomasWaldmann

It's very strange how you mention it was either at the end or saving a checkpoint: I'm pretty sure it doesn't reach the end because there are at least 200 files left when the backup gets stuck... but at the same time it's very unlikely it's doing a checkpoint save, given that it only takes around 5 minutes to reach the "stuck" point (because I had already done a full backup, so the majority of the time it's just skipping files).

Here's the repo directory:

pesav@vps-alpha:/backups/raspivault$ ls -la
total 1316
drwxr-xr-x 3 pesav pesav    4096 Aug 23 11:54 .
drwxr-xr-x 4 root  root     4096 Aug 18 14:10 ..
-rw------- 1 pesav pesav      73 Aug 18 14:13 README
-rw------- 1 pesav pesav     700 Aug 18 14:13 config
drwx------ 3 pesav pesav    4096 Aug 18 14:13 data
-rw------- 1 pesav pesav     640 Aug 23 09:36 hints.292
-rw------- 1 pesav pesav 1310858 Aug 23 09:36 index.292
-rw------- 1 pesav pesav     190 Aug 23 09:36 integrity.292
-rw------- 1 pesav pesav      16 Aug 23 09:36 nonce

pesav@vps-alpha:/backups/raspivault$ cat nonce
00000000900d3c0f

I'm also unsure about the cause of the ConnectionClosed exception: I would think because I interrupted the process with ^C, the SSH connection to the remote gets closed, interrupting borg serve, and thus terminating the connection. But it's also possible the reason it gets stuck in the first place is because it loses the connection, and only "realises" it after I terminate it... 🤷🏻‍♂️

Aug 24 '21 07:08 pesaventofilippo

200 files can be backed up sub-second usually in case most of them are unmodified.

The nonce file / permissions looks good.

Aug 24 '21 10:08 ThomasWaldmann

New update: sorry if I haven't been active recently, but I started school and I am quite busy. Since I needed that repo and the backup kept failing, I tried to delete the repo completely (with rm -r /backups/raspivault) and the borg cache in ~.cache/borg, and it worked for a while until a few days ago, when the same problem started happening again. I'm now quite sure the backup hangs just after it finishes transferring (or checking) the last file, and for whatever reason I think the client doesn't "realize" the backup finished so it never concludes.

Do you have any idea about what could be the cause and/or how to avoid this behaviour? For now I can delete and recreate the repo again, but it's obviously not a long-term solution.

Sep 14 '21 19:09 pesaventofilippo

No idea, sorry. Usually that all works (not just for me, but for a lot of people).

So, I can only guess: maybe your raspi or its configuration has some problem, can you try another target?

You could for example try borgbase.com (easiest), rsync.net or hetzner storage box.

Sep 15 '21 21:09 ThomasWaldmann

Thanks, I'll try using borgbase.com (only backing up a subset of files, because of the 10GB free limit), so I'll keep you updated if the problem also occurs with that. By chance, does borg cache disk information such as the GPT UUID or partition UUID of the source disk? Last night I recreated the partition table of the disk which holds my data (the original data, not the backup repo), and when the backup ran it hanged just before finishing, as always... I wonder if that's what causing the problem or if it's just a coincidence

Sep 18 '21 15:09 pesaventofilippo

borg works based on files, so it does not care for partition/gpt uuid.

what it does care about is the file's inode number though, so if you copy files around, the inode number will change.

Sep 18 '21 22:09 ThomasWaldmann

That could be a reason, but then shouldn't all backups (even local ones) fail in the same way? At this point I think something in my setup is wrong, like you said, but I can't find the problem (I deleted and recreated the repos many times, and other than that the remote is a normal ssh target, so nothing special).

I think I'll circumvent the problem by doing the backup with borg to a local repository, and then I'll use something like rclone to send it to my VPS... not an optimal solution because it doesn't take advantage of borg in the VPS, and it needs a local copy of the repo, but at least it should work as expected

Sep 19 '21 08:09 pesaventofilippo

Just wanted to chip in, I think I just experienced the same (a very similar?) issue.

Setup looks a little different for me: Client (borg 1.2.0): Win10 running borg in WSL2, holding the repokey, pushing to a...
Remote (borg 1.1.16): RasPi (same LAN) with a USB HDD, NTFS mounted using ntfs3 (not ntfs-3g) [1] .

I was struggling for a couple of days with borg getting stuck. I first noticed on a borg info call that would just hang – consistently. Just like for @pesaventofilippo, security: repository checks ok, allowing access was the last debug output. Some calls were issued by borgmatic, but I'd be surprised if that matters.
Then the same happpened after a couple of hours into a borg check (it's a fairly large repo, >1TB). Symptom was always borg serve getting stuck at 0% CPU, normal RAM usage, and htop showed it to hang disk-bound (process state: D). For me, it did not terminate when the client terminated, I had to kill -9 it – which I suspect might have contributed to the situation. Another interesting observation: even though the long borg check run was supposedly stuck disk-bound, I noticed that the USB HDD actually had gone into standby, so no actual data was being read.

After many attempts getting stuck in the same manner, I decided to reboot the remote, i.e. the Pi. The next borg info run then didn't hang anymore, CPU usage didn't drop to zero, and after several minutes it returned successfully. All subsequent runs went much faster again afterwards.

Interestingly, the first run reported RemoteRepository: 2.69 MB bytes sent, 2.37 GB bytes received, 50819 messages sent, which I suspect explains the delay, while the subsequent run had to pull much less data: RemoteRepository: 35.23 kB bytes sent, 37.49 MB bytes received, 671 messages sent.

I hope I'm not stating anything that should be obvious, but this behavior is at least not transparent to me. Posting this for posteriority, in case anyone else is experiencing similar issues.

[1]: There is some possibility that this is not actually the fault of borg, but of the ntfs3 driver. Especially considering that the process is getting stuck in D state. Hard for me to determine, though.

Apr 02 '23 14:04 codeceptsDE

I'm experiencing the same issue. borg break-lock didn't help. Whenever I cancel borg create after it does nothing for many hours, I see AssertionError: cleanup happened in Repository.__del__. It's not the easiest error to troubleshoot because otherwise there is absolutely no error output. But, as others have mentioned borg uses basically no CPU or memory and it's total processing time over many hours is in the milliseconds or very low seconds.

Unfortunately, I'm not in a place where I can try rebooting this machine at the moment (I'm accessing it remotely). But, I can try that in a week or so.

Dec 25 '23 14:12 matthuszagh

@matthuszagh That AssertionError frequently happens during borg shutdown and likely is not the root cause of your issue.

Dec 25 '23 15:12 ThomasWaldmann

FYI rebooting the computer does appear to have resolved the issue.

Jan 03 '24 18:01 matthuszagh