borg multithreading: input file discovery / reading parallelism

talked with @fd0 at 34c3 about MT and he mentioned that the sweet spot for input file discovery / reading parallelism seems to be 2 (for hard disks, hdd based RAID) when doing a backup.

1 is too little to use all the available resources / capability of the hw. more than 2 is too much and overloads the hw (random HDD seeks in this case).

Dec 31 '17 03:12 ThomasWaldmann

For an SSD this can be quite different, it would be great if one can adjust the number of threads for that case. Currently it takes an hour to scan my whole disk with a transfer size of less the 500MB. So for now I'll use borgbackup as a less frequent addition to TimeMachine (which I don't fully trust, I've seen files missing on spot checks).

It would also be great to know what parts take time, whether the scanning, matching, crypto, transfer or whatever.

Dec 31 '17 15:12 fschulze

I made several tests on very different hardware (with the help of a bunch of people), and it mostly did not help reading more than 2 files in parallel on SSD an/or RAID. But in order to do this in a more scientific way (I did not save the program and the data) I'd like to re-run these tests and graph the results.

Would you like to help with that and/or participate?

Jan 01 '18 09:01 fd0

@fd0 I can certainly provide measurements from my end if I don't have to setup too much. I'm totally fine installing a development borg version on my laptop, but would like to avoid doing anything on the storage side.

Jan 01 '18 11:01 fschulze

Yeah, I plan to do that in Go (concurrency is very easy), so the test binary is just a statically linked binary that you can build locally and then copy to the test machine and run it there (even cross-compilation is very easy).

Jan 01 '18 13:01 fd0

I've build a small program we can use for measurements here: https://github.com/fd0/prb

It traverses the given directory in one thread, and reads all files in a specified number of worker threads. For example, my benchmarks for a directory on the internal SSD of my laptop gives me:

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	326863	78365	25034388499	152.982354574	163642327
2	326863	78365	25034389119	108.725610135	230252919
3	326863	78365	25034389494	93.519914623	267690465
4	326863	78365	25034389948	89.576514578	279474927
5	326863	78365	25034390236	89.055505093	281109968
6	326863	78365	25034390629	88.652750661	282387071
7	326863	78365	25034390913	88.978444428	281353434
8	326863	78365	25034391508	88.363886038	283310214
9	326863	78365	25034396005	89.240226907	280528152
10	326863	78365	25034396341	88.483356924	282927741

(Still running for the internal hard drive...)

Jan 01 '18 14:01 fd0

On the internal NVMe: (MacOS 10.12)

Helper script:

TARGET=$HOME
for i in 1 2 3 4 5 6 7 8 9 10; do  
	sync && sudo purge
	bin/prb --workers $i --output /tmp/benchmarks.csv "$TARGET"
done

Results:

workers	files	dirs	bytes	time (seconds)	bandwidth (per second). 
1	134597	22829	33534469011	43.31894966	774129319. 
2	134597	22829	33534390461	26.958458113	1243928355   
3	134597	22829	33534441292	22.807006458	1470356986  
4	134597	22829	33534441365	20.826685577	1610166977  
5	134597	22829	33534295113	20.906010019	1604050465  
6	134597	22829	33534344292	21.007518011	1596302060  
7	134597	22829	33534391668	20.661027471	1623074734  
8	134597	22829	33534399908	20.66181482	1623013283  
9	134597	22829	33534453044	21.010812681	1596056923  
10	134597	22829	33534521348	20.446736352	1640091639

Jan 01 '18 19:01 jkahrs

Next data point: My internal hard disc:

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	11559	268	55096088012	852.778747346	64607717
2	11559	268	55096088012	974.067868801	56562884
3	11559	268	55096088012	1010.936685754	54500038
4	11559	268	55096088012	1057.294799461	52110431
5	11559	268	55096088012	1075.07961856	51248379
6	11559	268	55096088012	1110.684131519	49605541
7	11559	268	55096088012	1159.329260498	47524107
8	11559	268	55096088012	1180.693423095	46664177
9	11559	268	55096088012	1215.789427597	45317130
10	11559	268	55096088012	1252.950178241	43973087

Jan 01 '18 20:01 fd0

Another machine, reading data from an SSD via SATA:

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	88389	25083	14306522715	52.196173041	274091410
2	88389	25083	14306522715	35.386510019	404293124
3	88389	25083	14306522715	31.67159325	451714651
4	88389	25083	14306522715	31.08763338	460199801
5	88389	25083	14306522715	31.059477186	460616984
6	88389	25083	14306522715	31.167345965	459022809
7	88389	25083	14306522715	31.012411926	461316028
8	88389	25083	14306522715	30.927243427	462586416
9	88389	25083	14306522715	30.926412386	462598847
10	88389	25083	14306522715	31.153801543	459222374

Jan 02 '18 10:01 fd0

Same system, connected via USB3:

5400 rpm HDD

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	44370	12234	2557745423	64.929517906	39392644
2	44370	12234	2557745423	45.042011148	56785773
3	44370	12234	2557745423	47.941166942	53351755
4	44370	12234	2557745423	51.798936151	49378338
5	44370	12234	2557745423	54.807403461	46667881
6	44370	12234	2557745423	57.106539	44789011
7	44370	12234	2557745423	58.573511115	43667271
8	44370	12234	2557745423	60.245337734	42455491
9	44370	12234	2557745423	62.257068435	41083614
10	44370	12234	2557745423	64.328836775	39760479

SSD

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	12448	7383	44389369458	111.594261105	397774661
2	12448	7383	44389369458	104.685872615	424024449
3	12448	7383	44389369458	104.894013065	423183060
4	12448	7383	44389369458	104.908220452	423125749
5	12448	7383	44389369458	104.540388212	424614545
6	12448	7383	44389369458	104.80080464	423559433
7	12448	7383	44389369458	105.194333052	421974912
8	12448	7383	44389369458	104.782470494	423633545
9	12448	7383	44389369458	105.063509001	422500351
10	12448	7383	44389369458	105.366123482	421286918

Jan 02 '18 11:01 jkahrs

Guess the only things missing still are some RAID systems with many HDDs or SSDs.

Jan 02 '18 13:01 ThomasWaldmann

NVMe in a late 2016 15" MacBook Pro (4 cores with Hyperthreading):

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	4243314	756425	301394839765	1018.361214516	295960642
2	4243380	756425	301396556781	684.329069022	440426353
3	4243440	756426	301397998843	558.604872262	539554905
4	4243475	756426	301399037339	523.684396133	575535646
5	4243519	756426	301400593097	509.297344263	591796907
6	4243566	756427	301401846249	522.760056444	576558676
7	4244200	756429	301429018785	530.952168634	567714074
8	4244429	756429	301437311356	529.609069741	569169465
9	4244469	756429	301438821859	509.666049886	591443793
10	4244508	756429	301440390469	510.457233737	590530157

@jkahrs what machine do you have? Your NVMe speed is impressive

Now this all got me thinking. Is borg actually reading all the files for every backup? I thought it's more like rsync which only reads the files if the stats changed.

If it actually is only reading files when the stats change, then the directory traversal is the bottleneck. As you can see just my home directory has more than 4 million files.

If traversal is the bottleneck, does borg already use https://pypi.python.org/pypi/scandir? Making traversal multithreaded is most likely harder, but could speed things up a lot. What about xattr and resource forks, I guess because scandir doesn't include them, that could be multithreaded more easily.

If borg actually reads files for all backups, then an option to work like rsync would be very useful for me. I have backups for servers where I know that files don't change without stat changes and they have hdds where not reading stuff would help a lot.

Jan 02 '18 13:01 fschulze

NVME SSD in my workstation (with few big VM files)

workers files   dirs    bytes   time (seconds)  bandwidth (per second)
1       29      14      60873793654     22.810396927    2668686294    # 2.67GB/s
2       29      14      60873793654     28.385084934    2144569720
3       29      14      60873793654     34.139300982    1783100177
4       29      14      60873793654     33.857192252    1797957527
5       29      14      60873793654     33.319131428    1826992212
6       29      14      60873793654     33.566628508    1813521237
7       29      14      60873793654     33.624335908    1810408800
8       29      14      60873793654     33.641778505    1809470139
9       29      14      60873793654     33.238669888    1831414850
10      29      14      60873793654     33.186091691    1834316442

Looks like reading big files gets worse with more workers.

Jan 02 '18 14:01 ThomasWaldmann

Same, but with more and smaller/medium files.

workers files   dirs    bytes   time (seconds)  bandwidth (per second)
1       268881  35541   59742969794     40.711983296    1467454173  # 1.47 GB/s
2       268881  35541   59742969794     40.119823345    1489113480
3       268881  35541   59742969794     41.636482683    1434870717
4       268881  35541   59742969794     40.749647873    1466097817
5       268881  35541   59742969794     39.5652761      1509984908
6       268881  35541   59742969794     39.127989676    1526860191
7       268881  35541   59742969794     38.880705822    1536571122
8       268881  35541   59742969794     38.749396771    1541778060
9       268881  35541   59742969794     38.660238397    1545333714
10      268881  35541   59742969794     38.665501079    1545123382

Jan 02 '18 15:01 ThomasWaldmann

@fschulze borg does not open unchanged files. but it fetches stats, xattrs, acls, bsdflags.

Jan 02 '18 15:01 ThomasWaldmann

@fschulze this is the late 2016 13" model. I had the feeling that after downgrading back to Sierra with encrypted HFS+ the I/O went way up.

Software RAID5 HDD (8 Disks) (ext4):

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	225393	44283	16670948275	207.924925535	80177728
2	225393	44283	16670948275	152.340443245	109432189
3	225393	44283	16670948275	133.244525906	125115445
4	225393	44283	16670948275	122.852574096	135698811
5	225393	44283	16670948275	117.991702629	141289157
6	225393	44283	16670948275	131.906811125	126384287
7	225393	44283	16670948275	111.734195296	149201846
8	225393	44283	16670948275	112.290326643	148462906
9	225393	44283	16670948275	108.54295232	153588491
10	225393	44283	16670948275	106.306041287	156820328

Jan 02 '18 15:01 jkahrs

@jkahrs how many disks in total?

Jan 02 '18 15:01 ThomasWaldmann

@jkahrs I'm still on sierra. I got the 512GB NVMe, which one do you have? I'm kinda underwhelmed of the performance of mine now.

Jan 02 '18 15:01 fschulze

@ThomasWaldmann updated comment @fschulze that's also an 512GB drive. You seem to have a lot more files and folders than me.

Jan 02 '18 15:01 jkahrs

Ahh, now that looks different:

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	7660	2428	11783418573	13.727851779	858358522
2	7660	2428	11783418573	9.892907606	1191097606
3	7660	2428	11783418573	7.561183771	1558409229
4	7660	2428	11783418573	6.532947498	1803690995
5	7660	2428	11783418573	6.051080952	1947324563
6	7660	2428	11783418573	5.801837963	2030980294
7	7660	2428	11783418573	5.720344899	2059914005
8	7660	2428	11783418573	5.488674043	2146860695
9	7660	2428	11783418573	5.602929183	2103081832
10	7660	2428	11783418573	5.485737988	2148009729

Fewer but bigger files now.

Interestingly, for my machine 5 threads is the sweet spot, even though it's a 4 core machine. Probably because traversal isn't multithreaded and fits into the "hyperthreads".

So I think this benchmark shows that we should read files with more threads, but speeding up traversal if possible would be a bigger win for frequent backups.

Jan 02 '18 15:01 fschulze

2 x HDD 7200rpm (RAID1/mirror), images...

workers files   dirs    bytes   time (seconds)  bandwidth (per second)
1       10284   133     25842061595     197.603061023   130777638
2       10284   133     25842061595     113.138364694   228411128
3       10284   133     25842061595     129.485594142   199574800
4       10284   133     25842061595     130.49217482    198035335
5       10284   133     25842061595     130.815553928   197545787
6       10284   133     25842061595     132.337407454   195274050
7       10284   133     25842061595     133.104915206   194148063
8       10284   133     25842061595     133.876554421   193029031
9       10284   133     25842061595     133.90702043    192985113
10      10284   133     25842061595     135.983645633   190038011

Verdict: no parallelism at all, the disk head can only be in one place at a time. Ask it to do more at once, and you only get worse results.

NB: 2 heads in mirror, both can be used for reading simultaneously, thus optimal parallelism in this case = 2.

Jan 02 '18 15:01 zcalusic

NVMe 256GB, /home, lots od small files

workers files   dirs    bytes   time (seconds)  bandwidth (per second)
1       576739  127136  35025161895     105.954562112   330567756
2       576741  127136  35025467336     57.705307912    606971327
3       576741  127136  35025473059     44.181948115    792755289
4       576741  127136  35025484264     38.362061341    913024040
5       576741  127136  35025479521     35.911602085    975324894
6       576741  127136  35025482037     34.545554043    1013892612
7       576741  127136  35025482677     37.935898329    923280697
8       576741  127136  35025485464     33.846264666    1034840500
9       576741  127136  35025488407     33.539483021    1044306150
10      576741  127136  35025521527     33.082622413    1058728691

Verdict: it's a known fact that SSD storage has some internal parallelism, due to way it's built. Tests reveal that parallelism ~ 4 - 6 works best, and there's nothing to be gained above that (though, there's no slowdown either).

Jan 02 '18 16:01 zcalusic

Finally, the most interesting case.

MooseFS distributed networked file system, consisting of 6 storage servers (in another country, 13ms away)

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	27	1	3235793142	262.119989982	12344701
2	27	1	3235793142	87.114840502	37143994
3	27	1	3235793142	86.260050541	37512071
4	27	1	3235793142	62.383132636	51869680
5	27	1	3235793142	54.943678061	58892910
6	27	1	3235793142	55.362500856	58447380
7	27	1	3235793142	48.599625005	66580619
8	27	1	3235793142	66.102590446	48951079
9	27	1	3235793142	45.514665676	71093417
10	27	1	3235793142	42.760927895	75671724

Verdict: of course, when we get to network latencies, parallelism starts to be very helpful. In this particular case, the bandwidth is shared with other network users (on both sides), so it's not easy to get stable results. Yet, in general, as parallelism goes up, so does the throughput. If network latency was even higher, or client had more bandwidth available, then even higher parallelism (> 10) would be useful.

Jan 02 '18 17:01 zcalusic

Two NVMe drives in software raid 1, /home, lots of small files.

workers files   dirs    bytes   time (seconds)  bandwidth (per second)
1       100955  11789   11300634506     26.483996109    426696728
2       100955  11789   11300634506     16.512433857    684371220
3       100955  11789   11300634506     13.551913292    833877421
4       100955  11789   11300634506     12.153322282    929839120
5       100955  11789   11300634506     12.08701521     934940041
6       100955  11789   11300634506     11.974250177    943744646
7       100955  11789   11300634506     12.331832894    916379146
8       100955  11789   11300636242     11.390475056    992112812
9       100955  11789   11300636242     10.854376981    1041113300
10      100955  11789   11300636242     10.7035321      1055785710

Jan 02 '18 19:01 jdchristensen

Since chunking, compressing, encrypting, etc will take time, will having multiple file traversal threads help much in practice? I guess it will help in the common case where very little has changed.

Jan 02 '18 19:01 jdchristensen

Since chunking, compressing, encrypting, etc will take time, will having multiple file traversal threads help much in practice?

I think so, yes, if it's not so many threads all at once. Usually you have a pipeline to the individual stages (chunking, hashing for dedup, compression, archival) and keeping this pipeline well fed is important. Building a sample pipeline into the test program would be easy, do you think it's relevant to try that also?

Jan 02 '18 19:01 fd0

Interesting. I upgraded to High Sierra 10.13.2 and with many small files, it is now quite a bit slower with less then 4 threads and a good bit faster with 4 or more threads. For few bigger files the difference is within measuring margins I'd say.

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	4154598	752994	296045823508	1073.039318989	275894665
2	4154598	752994	296054272669	844.524821526	350557218
3	4154624	752997	296058951400	702.868742683	421215133
4	4154631	752999	296062264956	461.546142609	641457565
5	4154631	752999	296066096556	416.281698488	711215740
6	4154631	752999	296068853316	407.582615489	726402064
7	4154631	752999	296073022756	398.81490944	742382031
8	4154631	752999	296076032793	406.297594362	728717169
9	4154631	752999	296080745761	403.990764217	732889887
10	4154654	753000	296090027843	407.165517793	727198190

workers	files	dirs	bytes	time (seconds)	bandwidth (per second)
1	7660	2428	11783418573	13.745932444	857229483
2	7660	2428	11783418573	9.880593585	1192582052
3	7660	2428	11783418573	8.10193976	1454394740
4	7660	2428	11783418573	6.941451101	1697543986
5	7660	2428	11783418573	6.44924223	1827101255
6	7660	2428	11783418573	6.098571202	1932160531
7	7660	2428	11783418573	5.798642183	2032099619
8	7660	2428	11783418573	5.704184401	2065749938
9	7660	2428	11783418573	5.739902085	2052895397
10	7660	2428	11783418573	5.908465865	1994327942

Jan 04 '18 17:01 fschulze

@fschulze that's interesting. Is that encrypted APFS?

Jan 04 '18 17:01 jkahrs

@jkahrs both are full disk encryption, previously it was HFS+, now APFS. I wonder if the first mitigations for meltdown and spectre in 10.13.2 are causing some of the slowdowns with few threads due to switching between kernel and userland.

Jan 05 '18 09:01 fschulze

@fschulze I'd guess those changes came with https://support.apple.com/de-de/HT208331 wich would have included Sierra. Maybe my prior installation was just messed up in some way.

Jan 05 '18 13:01 jkahrs

@jkahrs good to know that those fixes seem to be included for El Capitan. I'm waiting for a new Mac mini to replace that box

Jan 05 '18 14:01 fschulze

borg borg copied to clipboard

multithreading: input file discovery / reading parallelism

borg
borg copied to clipboard