borg borg create (1.2.4) hangs on random files, no more activity nor IO, then get disconnected from remote. No issue when using rsync or netcat

Have you checked borgbackup docs, FAQ, and open GitHub issues?

A lot

Is this a BUG / ISSUE report or a QUESTION?

a BUG. And if I use rsync, there's no issue, so I guess the issue comes from borg, not from my network stack or routers across the network.

The filesystem is clean, badblocks and e2fsck passed, no error.

System information. For client/server mode post info for both machines.

borg version

1.2.4 both sides

Operating system (distribution) and version.

Debian 12 both sides

Hardware / network configuration, and filesystems used.

client

Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz 8 cores
16GiB RAM
MTU 1500
10 MiB/S network speed
ext4

server

Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz 8 cores
32GiB RAM
MTU 1500
10 MiB/S network speed
ext4

How much data is handled by borg?

around 500GiB

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create -v -s -p --compression lz4 --lock-wait 6 root@backup:/home/borg/foobar/::2023-07-13_0421 /home /usr/local/bin /root /etc /opt

Describe the problem you're observing.

borg hangs randomly everytime I run a backup.

the prompt hang on a file
then, no more IO activity detected in htop
strace on borg pid show tons of pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
finally, the remote server kill the connection after around 10mn
the strace show that borg close file descriptors

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Everytimes, yes. Just need to run borg create ...

Include any warning/errors/backtraces from the system logs

strace when borg hangs:

[...]
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=250787020})
read(7, "Timeout, server <IP> not responding.\r\n", 32768) = 45
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1706913, si_uid=0, si_status=255, si_utime=69 /* 0.69 s */, si_stime=21 /* 0.21 s */} ---
write(2, "Remote: Timeout, server <IP> not responding.\r\n", 53) = 53
pselect6(8, [5 7], [4], [4 5 7], {tv_sec=1, tv_nsec=0}, NULL) = 3 (in [5 7], out [4], left {tv_sec=0, tv_nsec=999995357})
read(5, "", 10485760)                   = 0
close(14)                               = 0
close(13)                               = 0
close(12)                               = 0
close(11)                               = 0
close(10)                               = 0
close(9)                                = 0
close(8)                                = 0
close(6)                                = 0
close(3)                                = 0

Jul 13 '23 02:07 sputnick-dev

Can you reproduce the problem without using ssh.

Jul 13 '23 06:07 infectormp

Using tar -cf - /path/to/dir | nc -w 3 <IP> <PORT> I can't reproduce the issue, no. Backup OK in one pass.

Edit: with 2.0.0b6, the first time (new repository) it's OK, but then next time it hangs like 1.2.4. It's a bit better than 1.2.4, because with the later, even the first time it's a fail.

Jul 13 '23 18:07 sputnick-dev

Try to reproduce the issue with a local repository, to determine whether there is a general problem with borg or just with borg client/server (and thus: likely the network connection or ssh being the problem).

Jul 20 '23 14:07 ThomasWaldmann

I only changed backup server, never had issues in the past like this, so the client doesn't have any changes from the past
sometimes, the problem appears not on the upload but when the remote borg create the archive at the end
There's other backups from other clients with bigger size to upload on the same borg server with no issue at all
discovered that borg create don't like the big Gitlab backup tar files of 9GiB. It's better compressed as tar.gz

Anyway, tested on local repository:

I can't reproduce the issue, even multiple times passes.

Jul 20 '23 16:07 sputnick-dev

OK, then it could be some network issue between that client and the server maybe?

Jul 20 '23 16:07 ThomasWaldmann

Anything in mind to check? As I said, no issue with rsync nor netcat. And mtr report no packet loss. But I have a few errors from my interface: RX errors 2

Jul 20 '23 16:07 sputnick-dev

I heard that some network equipment, like routers like to teardown a connection in case there is some pause in the transmission.

Or, in general, just some defective piece of HW, like a NIC, cable, switch.

In case the server runs low on RAM, that could maybe also cause a bigger pause, when it starts swapping heavily. Or even kill some process due to the OOM killer.

Jul 20 '23 16:07 ThomasWaldmann

I heard that some network equipment, like routers like to teardown a connection in case there is some pause in the transmission.

Or, in general, just some defective piece of HW, like a NIC, cable, switch.

In this case, rsync should be affected, no?

In case the server runs low on RAM, that could maybe also cause a bigger pause, when it starts swapping heavily. Or even kill some process due to the OOM killer.

As I said, it's a Xeon with 8 cores and 32G RAM, I can affirm there's no high load and just a bit of swap usage. There's just a few services now running.

This is my working workaround (a bit messy) :

#!/bin/bash     

export REPOSITORY=root@backup:/home/borg/foobar/
export BORG_RSH='ssh -i /root/.ssh/id_rsa'
export OPTIONS='-v -s -p --compression auto,lz4 --exclude-caches'
   
borg break-lock $REPOSITORY
borg compact $REPOSITORY

until borg create --debug $OPTIONS $REPOSITORY::$(date +%Y-%m-%d_%H%M) /home /usr/local/bin /root /etc /var /opt; do
    backup_exit=$?

    sleep 600

    borg break-lock $REPOSITORY
done

borg prune -v $REPOSITORY --keep-daily=7 --keep-weekly=4 --keep-monthly=12 --keep-yearly=1

exit $backup_exit

Jul 20 '23 16:07 sputnick-dev

Automated usage of break-lock should be avoided. One must make sure there is no borg running when calling that.

Jul 20 '23 17:07 ThomasWaldmann

This is due to the fact that I need to test again and again. But it's not the root cause of anything. There's only one client in this repository.

Jul 20 '23 17:07 sputnick-dev

I had a similar problem: Hangs (just periodic pselect in strace) and also one explicit disconnect. It was not possible to complete my backups (~40GB). Adding the following to /etc/ssh/ssh_config helped (on the side that opens the ssh connection):

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 5
    TCPKeepAlive yes

@ThomasWaldmann There is a FAQ entry "Why do I get “connection closed by remote” after a while?" that says "That’s a good question and we are trying to find a good answer in #636." #636 is closed and the above ssh settings can be found there. It might be helpful to document these settings/rewrite that FAQ entry since they seem to be important for remote backups. I can provide a PR if wanted.

Oct 18 '24 16:10 mrichtarsky

@mrichtarsky yes, an update is needed there.

The closing was due to this (so, guess we have all needed in our docs): https://github.com/borgbackup/borg/pull/4503/files

So, the faq can be updated to point to that part of the documentation as a potential solution for "hangs" and maybe avoiding the "connection closed by remote" by keeping the connection alive in phases when borg is not actively transmitting data.

If there is a real connection issue (like nothing can get through), the main function of ServerAlive/ClientAlive is rather to terminate a non-working connection though.

PR against master please, then backport to 1.4-maint and 1.2-maint after it is merged. Thanks for helping!

Oct 18 '24 16:10 ThomasWaldmann

Hmm, there are 3 faq entries dealing with this. Maybe they all should be consolidated and point to borg serve docs.

https://borgbackup.readthedocs.io/en/stable/faq.html#why-does-borg-extract-hang-after-some-time

Oct 18 '24 16:10 ThomasWaldmann