borg Multithreading

I started some experimental multithreading code there:

https://github.com/thomaswaldmann/borg/tree/multithreading

especially:

https://github.com/ThomasWaldmann/borg/commit/240a27a22774eb055c7bff5097b4d6557bbb81c5

From Python c-api docs: "the standard zlib and hashlib modules release the GIL when compressing or hashing data." So current compression and hashing should be ok for good multithreading, as well as python's I/O (it releases GIL before I/O ops).

Additionally, later changesets implement code to release the GIL there:

chunker
crypto
lz4 compression

:moneybag: there is a bounty for this

May 25 '15 22:05 ThomasWaldmann

About the question in the commit comment about longer "user time" when running multithreaded compare to single-threaded:

The answer might be that this was a dual core cpu with hyperthreading. While this looks like 4 cores to the OS (and Python), when really using it like 4 cores, each of these cores might appear as a bit slower CPU than when just using a single core, which might explain the longer "user time" needs.

It still was quite faster on the wall clock, so we don't worry.

Jun 27 '15 20:06 ThomasWaldmann

Some things are TODO there:

remote repo tests are broken (and thus currently disabled)
that delayer thread is strange, better solution?
likely AES counter uniqueness is broken
needs way more analysis / optimization
fix the tests, they are quite broken / hang right now

Also, the bounty should be higher, this is quite a lot of work.

Sep 06 '15 22:09 ThomasWaldmann

Just an idea

Traditional multi-threading (shared data, locking, queuing etc.) is often neither simple to develop nor test. I have been using ZeroMQ to effectively multi-thread things in a few situations now, and feel that it makes things much easier with essentially no relevant overhead (for inproc://), especially when using zerocopy.

I understand that in borg files are essentially processed by a pipeline of algorithms:

reading => chunking => hashing / encrypting => storage

This might lend itself well to the pipeline pattern in zmq, e.g. every one of these stages is handled by one or more nodes (=threads). Work distribution, queuing etc. are handled by zmq transparently.

This zmq approach to multithreading is not unlike microservices. It would probably mean extensive refactoring for Borg. OTOH it would lead to Borg scaling very well with available CPU resources.

Mar 09 '16 12:03 enkore

Interesting idea. Can you say what the pros of this approach are compared to Python's Queue module (which is in stdlib and threadsafe)? The Queue module is what the multithreading code uses currently.

Mar 09 '16 14:03 ThomasWaldmann

Ah I see how you approached it, it is conceptually already very close. I also see the "delayer problem" now.

On a technical level the zmq approach is language agnostic, so it's [it can be] straightforward to implement entire stages in C with no Python/GIL involvement at all (and the ability to still do this zerocopy: separate, pure C-thread runs a node in the same process, so inproc:// can be used between Python and C parts, without ever copying the data).

Mar 10 '16 11:03 enkore

Regarding compression: We are using https://github.com/madler/pigz for parallel compressing of petabyte of archives. Maybe it could help here. We never had an issue - and the performance on multicore systems is awsome!

Mar 26 '17 12:03 stilsch

@schuft69 please see the ticket about compression about why this is not that simple in case of borg.

Mar 26 '17 12:03 ThomasWaldmann

Each chunk (0 byte ... 8 MB) is compressed independently, so parallel compressors tend to not work well there; instead we'll run separate stages (hashing, encryption, compression) in parallel and maybe multiple operations in parallel as well (e.g. compressing multiple chunks independently using independent compressors in parallel).

We're working on drafts regarding this topic; we will put more work forward here once 1.1 reaches freeze / RC status.

Mar 26 '17 12:03 enkore

The multithreaded branch hasn't been updated since March 2016, about 1.25 years, and is missing about half of the commits in the current HEAD. How much of that approach remains viable?

Lack of some MT support makes borgbackup basically unusable for me. Even basic support would make things much more bearable (for example, compressing multiple chunks simultaneously).

I'd like to make contributions to help, but I'm not sure where to start right now. What's the current status on MT draft/direction?

Jun 17 '17 19:06 sjuxax

@sjuxax that was just an experiment and should not be used.

real, better working MT is planned for borg 1.2.

coding work on 1.2 has not started yet, we are still finishing 1.1. there is some planning, see milestones and tickets here.

Jun 17 '17 19:06 ThomasWaldmann

The multithreading branch was never intended to be merged; it was one of now several tests. The Borg 1.2 entry in the (project management) wiki might be of interest. Some further tests and planning has been conducted, but partly not published yet (some of that stuff is not in English and has to be translated / rewritten first).

The current plan (this is from March) looks basically like this:

borg-mt1 2

Grey boxes are individual actors / threads. Arrows indicate channels: orange are queue-like, green is RPC, violet is queue-like for metadata and blue is the same for errors.

Jun 17 '17 19:06 enkore

@sjuxax btw, why is it unusable without multithreading for you?

Jun 17 '17 20:06 ThomasWaldmann

I'm mainly interested in using it to back up large, frequently-changing targets. Even if the job would normally complete without wrapping around on itself (i.e., the prior job wouldn't finish before the next interval), it's not worth the extra cost and risk to keep the backup going in the background for much longer than it would take other backup targets to complete the job.

If I was backing up around 10G of stuff, it probably wouldn't be a big deal. I'm interested in backing up things that are much larger, almost always >= 100G with many small files and some very large files (5-10G+). I have not tested lately, but back in April when I last tested, I definitely felt it was too slow for regular use.

As an example, I'm backing up a 5T filesystem right now (from disks that like to choke every few hours). While I recognize other limitations in borgbackup may make a 5T backup implausible, it would be convenient to be able to use it for more common tasks in the 100G-500G range, but without multithreading, it's just too slow for me (especially the compression stage; I'm not encrypting within borgbackup).

I've tried several things over the last little bit, including SquashFS, fsarchiver, and others. Right now I am using ZFS on loopback with compression and deduplication enabled and rsyncing the tree over. That seems to be working the best, and it makes it easy to resume the process when the disks hang.

Jun 17 '17 23:06 sjuxax

5T works quite well; the largest backup set I know of that was "publicly admitted" to use Borg is >40T. What's clear of course is that Borg 1.0 and 1.1 have fundamental limitations on their processing speed... whether this is a nasty problem for a big backup set has more to do with it's workload and less with the sheer size of it. Fast-changing sets require the most processing, while other workloads are usually less problematic. Files that don't change are faster to process than files that do; bigger files are slower than smaller files (<=one chunk) etc.

But yeah, I'm aware of these limitations and I'll / we'll try to address them with 1.2.

Jun 17 '17 23:06 enkore

Well, you need to consider the amount of change between backups. If it is too slow to backup the changes, yeah, then you have a problem.

The total amount (as long as it is within the limitations) doesn't matter much, borg skips unchanged files rather quickly (be careful whether inode numbers are stable for your fs and always mount at same mountpoint).

Assuming you have enough backup space (after considering how much you safe due to historical dedup), you can use lz4 compression (you won't need multiple threads then to get compression fast).

Encryption is also fast, IF you have AES-NI.

It's not perfect yet (thus our plans for 1.2), but quite ok - not only for small backups.

Jun 17 '17 23:06 ThomasWaldmann

Here's my prototype from March which roughly corresponds to the current plan.

Easy refactorings like FSOProcessors and MetadataCollector seem prudent to me for a start.

Jul 28 '17 22:07 enkore

hmm, guess it makes more sense if you do a PR with that and I review it. And you could finish reviewing the crypto-aead stuff, so we can merge it first, so I don't need to rebase it again / fix conflicts again.

Jul 28 '17 23:07 ThomasWaldmann

btw, the branch has only few changesets with huge changes. i imagined a bit smaller steps / less risk approach.

Jul 28 '17 23:07 ThomasWaldmann

Regarding large (in size and/or in number of files) backups. Here is my use case.

In the past, I only used BorgBackup for not-very-busy servers. But recently I decided to try to back up a CI server. With lots of node_modules/ (Node.js) and vendor/ (Go) directories, the server has tons of small files.

The initial backup (~42 GiB) took only 42 minutes. That's about 1 GiB per minute. Considering the large number of files and network transfer time (I'm backing up from a data center near San Jose, CA to AWS us-east-1 N. Virginia region over SSH), it's quite impressive.

Here is the most recent backup:

Time (start): Tue, 2018-03-20 01:46:24
Time (end):   Tue, 2018-03-20 02:06:34
Duration: 20 minutes 9.91 seconds
Number of files: 6134998
Utilization of max. archive size: 2%

Even with single-core, the speed is acceptable. However, the fact that only one of the many cores of the CI server reaches 100% usage during the backup makes think "what a waste of CPU cores" :-)

Mar 20 '18 06:03 wzyboy

What is the current state of this issue?

With borg create --compression auto,zstd,3 on modern hardware (ssd) backup process speed is limited by speed of one cpu core. And it can be much faster on multiple cores.

Aug 12 '18 20:08 knutov

See the milestones. Next release (1.2) will not yet have it.

It's quite a lot of work to restructure borg for this, better funding would be desirable also.

Aug 12 '18 21:08 ThomasWaldmann

Any news on this?

Aug 07 '19 09:08 alexander-rieder

No.

Aug 07 '19 12:08 ThomasWaldmann

Do you need help on this issue?

May 27 '20 00:05 br-olf

work on this (besides planning, experiments) has not started yet and likely some crypto stuff needs to be solved first (e.g. avoiding AES counter mode issues).

May 27 '20 08:05 ThomasWaldmann

OK sure.

Anyways here is my idea: The chunker should fill multiple chunk queues, one for each thread we want to spawn. Then we can process each queue with a separate thread. This way we can parallelize the whole pipeline (hashing, cache lookup, compressing and maybe even encrypting).

This would require the cache to be thread save or read only and will probably result in RAM demands scaling with the number of threads.

May 27 '20 12:05 br-olf

There is quite a nice bounty on this! I'd like to cash in on it. anyone thats been active for awhile can summarize what needs to be done? I have some quarantine time on my hands i would like to monetize on :P

May 30 '20 17:05 jean-phillipe88

Added some more cash to the bounty. Hope we can get this feature soon!

Jun 05 '20 04:06 Justinzobel

I'm ready to start work on it. I've sent Borg an email and waiting for a confirmation, but have yet to receive that.

Jun 05 '20 07:06 jean-phillipe88

I'm not TW (obviously) but my impression was that this issue ("multithreading") is mostly an epic "umbrella" issue. There are a lot of nuances involved (and probably some larger changes to borg's internal architecture).

I'm pretty sure the current bounty (USD 1500) is very very low compared to the regular rates of a professional software developer who needs to spend time on this (otherwise @ThomasWaldmann and others would have solved this a long time ago).

From what I can see #929 is a better place to start. The first step would be to define realistic intermediate goals (there are several operations which could benefit from using multiple CPU cores) and agree on ways to measure the impact.

Jun 05 '20 07:06 FelixSchwarz