borg create (1.2.4) hangs on random files, no more activity nor IO, then get disconnected from remote. No issue when using rsync or netcat
Have you checked borgbackup docs, FAQ, and open GitHub issues?
A lot
Is this a BUG / ISSUE report or a QUESTION?
a BUG. And if I use rsync, there's no issue, so I guess the issue comes from borg, not from my network stack or routers across the network.
The filesystem is clean, badblocks and e2fsck passed, no error.
System information. For client/server mode post info for both machines.
borg version
1.2.4 both sides
Operating system (distribution) and version.
Debian 12 both sides
Hardware / network configuration, and filesystems used.
client
- Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz 8 cores
- 16GiB RAM
- MTU 1500
- 10 MiB/S network speed
- ext4
server
- Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz 8 cores
- 32GiB RAM
- MTU 1500
- 10 MiB/S network speed
- ext4
How much data is handled by borg?
around 500GiB
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg create -v -s -p --compression lz4 --lock-wait 6 root@backup:/home/borg/foobar/::2023-07-13_0421 /home /usr/local/bin /root /etc /opt
Describe the problem you're observing.
borg hangs randomly everytime I run a backup.
- the prompt hang on a file
- then, no more IO activity detected in
htop straceonborgpid show tons ofpselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)- finally, the remote server kill the connection after around 10mn
- the
straceshow thatborgclose file descriptors
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Everytimes, yes. Just need to run borg create ...
Include any warning/errors/backtraces from the system logs
strace when borg hangs:
[...]
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=250787020})
read(7, "Timeout, server <IP> not responding.\r\n", 32768) = 45
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1706913, si_uid=0, si_status=255, si_utime=69 /* 0.69 s */, si_stime=21 /* 0.21 s */} ---
write(2, "Remote: Timeout, server <IP> not responding.\r\n", 53) = 53
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 3 (in [5 7], out [4], left {tv_sec=0, tv_nsec=999995357})
read(5, "", 10485760) = 0
close(14) = 0
close(13) = 0
close(12) = 0
close(11) = 0
close(10) = 0
close(9) = 0
close(8) = 0
close(6) = 0
close(3) = 0
Can you reproduce the problem without using ssh.
Using tar -cf - /path/to/dir | nc -w 3 <IP> <PORT> I can't reproduce the issue, no. Backup OK in one pass.
Edit: with 2.0.0b6, the first time (new repository) it's OK, but then next time it hangs like 1.2.4. It's a bit better than 1.2.4, because with the later, even the first time it's a fail.
Try to reproduce the issue with a local repository, to determine whether there is a general problem with borg or just with borg client/server (and thus: likely the network connection or ssh being the problem).
- I only changed backup server, never had issues in the past like this, so the client doesn't have any changes from the past
- sometimes, the problem appears not on the upload but when the remote
borgcreate the archive at the end - There's other backups from other clients with bigger size to upload on the same
borgserver with no issue at all - discovered that
borg createdon't like the bigGitlabbackup tar files of 9GiB. It's better compressed astar.gz
Anyway, tested on local repository:
I can't reproduce the issue, even multiple times passes.
OK, then it could be some network issue between that client and the server maybe?
Anything in mind to check? As I said, no issue with rsync nor netcat. And mtr report no packet loss. But I have a few errors from my interface: RX errors 2
I heard that some network equipment, like routers like to teardown a connection in case there is some pause in the transmission.
Or, in general, just some defective piece of HW, like a NIC, cable, switch.
In case the server runs low on RAM, that could maybe also cause a bigger pause, when it starts swapping heavily. Or even kill some process due to the OOM killer.
I heard that some network equipment, like routers like to teardown a connection in case there is some pause in the transmission.
Or, in general, just some defective piece of HW, like a NIC, cable, switch.
In this case, rsync should be affected, no?
In case the server runs low on RAM, that could maybe also cause a bigger pause, when it starts swapping heavily. Or even kill some process due to the OOM killer.
As I said, it's a Xeon with 8 cores and 32G RAM, I can affirm there's no high load and just a bit of swap usage. There's just a few services now running.

This is my working workaround (a bit messy) :
#!/bin/bash
export REPOSITORY=root@backup:/home/borg/foobar/
export BORG_RSH='ssh -i /root/.ssh/id_rsa'
export OPTIONS='-v -s -p --compression auto,lz4 --exclude-caches'
borg break-lock $REPOSITORY
borg compact $REPOSITORY
until borg create --debug $OPTIONS $REPOSITORY::$(date +%Y-%m-%d_%H%M) /home /usr/local/bin /root /etc /var /opt; do
backup_exit=$?
sleep 600
borg break-lock $REPOSITORY
done
borg prune -v $REPOSITORY --keep-daily=7 --keep-weekly=4 --keep-monthly=12 --keep-yearly=1
exit $backup_exit
Automated usage of break-lock should be avoided. One must make sure there is no borg running when calling that.
This is due to the fact that I need to test again and again. But it's not the root cause of anything. There's only one client in this repository.
I had a similar problem: Hangs (just periodic pselect in strace) and also one explicit disconnect. It was not possible to complete my backups (~40GB). Adding the following to /etc/ssh/ssh_config helped (on the side that opens the ssh connection):
Host *
ServerAliveInterval 60
ServerAliveCountMax 5
TCPKeepAlive yes
@ThomasWaldmann There is a FAQ entry "Why do I get “connection closed by remote” after a while?" that says "That’s a good question and we are trying to find a good answer in #636." #636 is closed and the above ssh settings can be found there. It might be helpful to document these settings/rewrite that FAQ entry since they seem to be important for remote backups. I can provide a PR if wanted.
@mrichtarsky yes, an update is needed there.
The closing was due to this (so, guess we have all needed in our docs): https://github.com/borgbackup/borg/pull/4503/files
So, the faq can be updated to point to that part of the documentation as a potential solution for "hangs" and maybe avoiding the "connection closed by remote" by keeping the connection alive in phases when borg is not actively transmitting data.
If there is a real connection issue (like nothing can get through), the main function of ServerAlive/ClientAlive is rather to terminate a non-working connection though.
PR against master please, then backport to 1.4-maint and 1.2-maint after it is merged. Thanks for helping!
Hmm, there are 3 faq entries dealing with this. Maybe they all should be consolidated and point to borg serve docs.
https://borgbackup.readthedocs.io/en/stable/faq.html#why-does-borg-extract-hang-after-some-time