WIP: experiment with custom git archive command
Goal is to see how the viability of replacing git archive with a format and command optimized to only send what sourcegraph cares about.
The last few days I've been experimenting with an alternative to git archive which is aware of sourcegraph's ignore policies (.sourcegraph/ignore and large file limits). Additionally I wanted to make it aware of diffing trees, this is so we could end up with a fast way to just get what has changed. This has been implemented in go-git. The WIP code is at zoekt#424.
go-git just seems to be slow. It is over twice as slow, even though it ends up needing to unmarshal far less objects. This approach is likely worth exploring further though, given that I suspect this will scale with the size of output if we were as fast as git.
See this table for comparison
| repo | output(git) | output(sg) | time(git) | time(sg) |
|---|---|---|---|---|
| megarepo | 3.47GB | 2.83GB | 52s | 132s |
| sourcegraph | 145MB | 96MB | 1.1s | 2s |
Note: megarepo was recorded on the git-combine pod. sourcegraph was recorded on my macbook.
From profiling, there are surprising things. For example 12% is spent in Packfile.Close. This tells me there is likely no state keeping packfiles open, which means we are likely paying a huge cost per object just opening and looking inside of packfiles. Hopefully I can adjust my usage of the API. Alternatively I could introduce more state into go-git for performance.
The other next approach I was considering was writing this command in rust. In the past I wrote a small program using rust's bindings with libgit2 and it was pleasant.
GIT_DIR=$PWD /usr/bin/time -v git archive HEAD 2> git-archive.time | wc -c > git-archive.size
GIT_DIR=$PWD /usr/bin/time -v git-sg 2> git-sg.time | wc -c > git-sg.size
3 473 141 760
Command being timed: "git archive HEAD"
User time (seconds): 52.32
System time (seconds): 9.97
Percent of CPU this job got: 90%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1m 8.88s
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8755968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 867731
Voluntary context switches: 301216
Involuntary context switches: 88
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
2 839 758 848
Command being timed: "../git-sg"
User time (seconds): 132.83
System time (seconds): 11.88
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2m 26.00s
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16032144
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 671416
Voluntary context switches: 194976
Involuntary context switches: 272
Swaps: 0
File system inputs: 864
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Sourcegraph repo on the mac
/usr/bin/time -l git sg | wc -c
2.09 real 1.41 user 0.57 sys
333512704 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
82341 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
181 signals received
5951 voluntary context switches
3009 involuntary context switches
26217999 instructions retired
22263437 cycles elapsed
733184 peak memory footprint
96 385 536
/usr/bin/time -l git archive HEAD | wc -c
1.12 real 0.54 user 0.13 sys
200888320 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
43711 page reclaims
5613 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
9579 voluntary context switches
327 involuntary context switches
3305843152 instructions retired
2552117752 cycles elapsed
96268288 peak memory footprint
145 233 920
Update from yesterday I forgot to post:
I spent a lot of time writing some fun integration with git-cat-file. The code is quite nice and performant, but still doesn't beat git archive. Even though archive sends 1.5x more data (96mb vs 145mb). This is on sourcegraph/sourcegraph.
Hyperfine results:
$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
Time (mean ± σ): 338.5 ms ± 3.3 ms [User: 310.0 ms, System: 28.4 ms]
Range (min … max): 335.3 ms … 344.9 ms 10 runs
Benchmark 2: git sg
Time (mean ± σ): 905.7 ms ± 16.0 ms [User: 837.3 ms, System: 95.6 ms]
Range (min … max): 878.1 ms … 926.4 ms 10 runs
Benchmark 3: GIT_SG_FILTER=1 git sg
Time (mean ± σ): 377.8 ms ± 6.4 ms [User: 388.1 ms, System: 95.1 ms]
Range (min … max): 367.6 ms … 388.2 ms 10 runs
Benchmark 4: GIT_SG_CATFILE=1 git sg
Time (mean ± σ): 451.8 ms ± 10.5 ms [User: 372.7 ms, System: 155.8 ms]
Range (min … max): 441.7 ms … 478.6 ms 10 runs
Summary
'git archive --worktree-attributes --format=tar HEAD' ran
1.12 ± 0.02 times faster than 'GIT_SG_FILTER=1 git sg'
1.33 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
2.68 ± 0.05 times faster than 'git sg'
Looking at CPU profiles for cat-file, we spend as much time running Info as Contents. To me this is a sign that the overhead of RPC / Info is not worth it. We could look into a queue like design to send multiple blob/info requests out before reading, but that seems complicated and based on the perf I doubt will make it faster than archive.
Final attempt in this experiment, mix together git ls-tree -r -l (to get object size) with git cat-file for contents only. This does mean we will explore trees which are excluded. This is fine for now, but is an overhead when thinking about a future with sub-repo perms. Additionally it won't affect the repo I am testing against, since it has no ignore rules (only size filters).
Using ls-tree is pretty much the same speed as git archive on sourcegraph repo. We only skip 7 files in that repo, which means its hard to beat the speed of git archive.
There is opportunity to make it faster:
- async send and read of contents from
git-cat-file - minor: directly use
gitCatFileBatchReaderand usegit cat-file --batch
I did some profiling, and this solution barely generated any garbage so is super efficient. This means I'll export the code and integrate it directly into gitserver to try and create and end to end demo.
A note on buffering. Testing with hyperfine adding output buffering slowed it down slightly. I wonder if in practice though the buffer will be more important due to the output being over the network rather than to /dev/null.
$ hyperfine -w 1 'git archive --worktree-attributes --format=tar HEAD' 'git sg' 'GIT_SG_FILTER=1 git sg' 'GIT_SG_CATFILE=1 git sg' 'GIT_SG_LSTREE=1 git sg'
Benchmark 1: git archive --worktree-attributes --format=tar HEAD
Time (mean ± σ): 348.1 ms ± 3.8 ms [User: 319.4 ms, System: 28.1 ms]
Range (min … max): 342.0 ms … 353.2 ms 10 runs
Benchmark 2: git sg
Time (mean ± σ): 921.3 ms ± 12.0 ms [User: 862.0 ms, System: 91.2 ms]
Range (min … max): 899.9 ms … 937.4 ms 10 runs
Benchmark 3: GIT_SG_FILTER=1 git sg
Time (mean ± σ): 385.1 ms ± 7.8 ms [User: 395.5 ms, System: 93.1 ms]
Range (min … max): 373.8 ms … 402.2 ms 10 runs
Benchmark 4: GIT_SG_CATFILE=1 git sg
Time (mean ± σ): 451.4 ms ± 8.3 ms [User: 383.2 ms, System: 145.2 ms]
Range (min … max): 439.2 ms … 463.0 ms 10 runs
Benchmark 5: GIT_SG_LSTREE=1 git sg
Time (mean ± σ): 358.3 ms ± 4.2 ms [User: 359.0 ms, System: 113.7 ms]
Range (min … max): 352.6 ms … 367.2 ms 10 runs
Summary
'git archive --worktree-attributes --format=tar HEAD' ran
1.03 ± 0.02 times faster than 'GIT_SG_LSTREE=1 git sg'
1.11 ± 0.03 times faster than 'GIT_SG_FILTER=1 git sg'
1.30 ± 0.03 times faster than 'GIT_SG_CATFILE=1 git sg'
2.65 ± 0.05 times faster than 'git sg'