borg icon indicating copy to clipboard operation
borg copied to clipboard

Multiple Connections/Streams

Open Slind14 opened this issue 3 years ago • 18 comments

Are there any plans for supporting multiple concurrent connections for the data transfer? Or is this already possible somehow?

Doing backups across > 1G networks is quite slow, due to the bottleneck of a single connection. For cross-continent backups, it can be even worse, where a single connection won't be able to utilize a 1G and sit at 200M max.

Slind14 avatar Jun 26 '22 11:06 Slind14

You can run multiple borg processes in parallel, backing up to one repo per process.

ThomasWaldmann avatar Jun 26 '22 12:06 ThomasWaldmann

Hi Thomas,

we use it to backup data from a data warehouse. We can't split the data across multiple repos without losing consistency I'm afraid. Is there another option?

Slind14 avatar Jun 26 '22 12:06 Slind14

no. not being able to saturate your connection with 1 borg likely comes from internal processing being single-threaded and not internally queued.

but not sure how you ensure consistency. if you used a snapshot to get consistency, you could also run multiple borg to save the snapshot.

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

Is this the first backup you are doing or is there already data in the repo from previous backups?

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

it is not the first backup we just got the point where they can't complete within a day anymore.

When we use iperf3 to measure the bandwidth then we can see that a single connection only gets 100-200M while multiple get > 900M.

For data centers that are not on the other side of the world, we get a higher bandwidth for a single connection. So I doubt it is borg directly. Btw. borg CPU usage is always sitting at 10-20% of one core while uploading. Only when saving the file cache does it go to 100% and bandwidth to 0. The files are also quite large (multiple GB).


We do have a hardlink-based snapshot. How would we run multiple borg processes and ensure that they are not cannibalizing each other and also that we end up with a consistent backup?

Slind14 avatar Jun 26 '22 13:06 Slind14

borg manages caching, indexes and locking based on the repo id (which is unique and random). so you can run borg on the same machine, as the same user, at the same time IF you use different repos.

so you could partition your input data set and give each part to another borg.

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

also wondering why a not-first backup takes that long. does the dedup not work or is it really lots of NEW data?

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

also wondering why a not-first backup takes that long. does the dedup not work or is it really lots of NEW data?

There is more new data than 100MBit/s can do.

Slind14 avatar Jun 26 '22 13:06 Slind14

borg manages caching, indexes and locking based on the repo id (which is unique and random). so you can run borg on the same machine, as the same user, at the same time IF you use different repos.

so you could partition your input data set and give each part to another borg.

Unfortunately, partitioning is not possible with the way the data is stored. 90% is under the same directory and then goes into around one million files.

Slind14 avatar Jun 26 '22 13:06 Slind14

ok.

iirc there is some --upload-buffer (or so) option, maybe you can try using that to speed it up.

you use some fast compression (default is lz4, zstd,1 .. zstd,3 would also work i guess)?

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

another idea is not to use different repo for partitions of the data, but for different times.

not pretty, but would work: use a different repo depending on weekday.

ThomasWaldmann avatar Jun 26 '22 13:06 ThomasWaldmann

iirc there is some --upload-buffer (or so) option, maybe you can try using that to speed it up.

the data is already compressed, hence we don't use any


Are there any plans to support multi-connection uploads? Would it be a major change or something simple?

Slind14 avatar Jun 26 '22 14:06 Slind14

another idea is not to use different repo for partitions of the data, but for different times.

the majority of the new data is from the last 24 hours :( these are in the same place - not really possible to be split.

Slind14 avatar Jun 26 '22 14:06 Slind14

--upload-buffer is about buffering, not compression.

ThomasWaldmann avatar Jun 26 '22 14:06 ThomasWaldmann

--upload-buffer is about buffering, not compression.

Sorry I quoted the wrong line. ;)

Slind14 avatar Jun 26 '22 14:06 Slind14

Unfortunately changing the buffer does not help.

Restic added parallel uploads not too long ago, if borg had something similar it would be great.

https://github.com/restic/restic/pull/3593 https://github.com/restic/restic/pull/3513

Slind14 avatar Jun 26 '22 21:06 Slind14

with the current backend structure multi connection upload are not sensibly possible as the log structured store is not concurrent and the encryption scheme is also not yet prepared for such a scenario

i would imagine that a major refactor would be necessary to support them

RonnyPfannschmidt avatar Jun 27 '22 05:06 RonnyPfannschmidt

I see, thank you.

Slind14 avatar Jun 27 '22 09:06 Slind14