Viewing logs of specific path in large repo is much slower compared to git
Description
For example, viewing logs of README in linux kernel repo, git takes about 0.9s, jj takes 68s, 70 times slower.
Steps to Reproduce the Problem
$ git clone https://github.com/torvalds/linux.git linux-git
$ cd linux-git
$ time git log -- README >/dev/null
real 0m0.936s
user 0m0.732s
sys 0m0.204s
$ jj git clone https://github.com/torvalds/linux.git linux-jj
$ cd linux-jj
$ time jj log --no-graph README >/dev/null
real 1m8.155s
user 1m8.037s
sys 0m0.597s
Expected Behavior
The performance of viewing specific path in large repo should be comparable to git.
Actual Behavior
See above.
Specifications
- Platform: Linux
- Version: 0.22
Did you try timing the second time you do a jj log command? (Is the first one doing imports from Git for the whole repo or would that be done on the clone command?)
The time is consistent for a second time run, you will immediately notice the slowness when viewing logs of an specific path in a large repo like https://github.com/NixOS/nixpkgs or https://github.com/torvalds/linux, colocated or not.
IIRC from discussing on the Discord in the past there are two main contributors to this:
-
The default log template notes whether commits are empty; for merge commits, this involves merging the parents and checking if there is any difference (e.g. because of conflicts being resolved), and there is currently no caching of this. You can customize the templates (potentially per–repository) to avoid this.
-
Git does fancy smart bloom filter stuff that I don‘t understand to skip over commits when searching the log like that. Jujutsu doesn‘t do any of that at present. Git also has slightly different behaviour: merge commits are never shown at all in
git log ‹path›.
You can customize the templates (potentially per–repository) to avoid this.
$ time jj log --template commit_id --no-graph README >/dev/null
real 1m7.938s
user 1m7.849s
sys 0m0.572s
Showing a simple template like commit_id take about the same time as the default builtin_log_compact template.
The sys times of both are very similar. It's the real time that is inflated.
When I try it, I get
Warning: The argument "README" is being interpreted as a path. To specify a revset, pass -r "README" instead.
Does that indicate a different logic path?
Are you using commit-graph https://git-scm.com/docs/commit-graph? Can you try with commit graph disabled? (I think git -c core.commitGraph=false log ...)
- You can customize the templates (potentially per–repository) to avoid this.
That's not actually enough because we still do the same calculation to even tell whether to include a merge commit in the result. I think Git instead compares the merge commit against each of its parents to see if the file is modified there. I suppose we should be able to do that as a first step and only do the merge if more than one diff contains the file.
Git appears to omit merged branches if one of the parents has identical content. That's probably the main factor why git log README is super fast in (merge-heavy) Linux repo.
https://github.com/git/git/blob/v2.47.0/revision.c#L1200
git -c core.commitGraph=false log --full-history README is still faster than jj, but the performance is closer.
Are you using commit-graph https://git-scm.com/docs/commit-graph? Can you try with commit graph disabled? (I think
git -c core.commitGraph=false log ...)
The different is quite small when setting core.commitGraph to true and false:
$ hyperfine -w 1 'git -c core.commitGraph=false log README >/dev/null'
Benchmark 1: git -c core.commitGraph=false log README >/dev/null
Time (mean ± σ): 926.2 ms ± 3.4 ms [User: 748.5 ms, System: 177.4 ms]
Range (min … max): 922.2 ms … 932.3 ms 10 runs
$ hyperfine -w 1 'git -c core.commitGraph=true log README >/dev/null'
Benchmark 1: git -c core.commitGraph=true log README >/dev/null
Time (mean ± σ): 925.5 ms ± 1.8 ms [User: 749.9 ms, System: 175.5 ms]
Range (min … max): 923.1 ms … 928.9 ms 10 runs
Git appears to omit merged branches if one of the parents has identical content.
Good idea!
git -c core.commitGraph=false log --full-history READMEis still faster than jj, but the performance is closer.
I can confirm: On my machine, that takes about 17 seconds instead of just under 1 second without --full-history.
How would one approach fixing this? I've been using jj daily on Nixpkgs for about 2 weeks now and the speed is by far one of only two things that's been holding me back from recommending it to everyone around me 😅 I had to run jj log -r "reachable(@, mutable())" constantly to avoid spending ~20 seconds on the log alone
Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.
In order to implement git log (without --full-history) behavior, we'll need to add a revset function that queries commits by file paths but omits "uninteresting" branches. Maybe this revset function can leverage graph iterator internally, but we'll need some modifications on it.
Another related issue is https://github.com/martinvonz/jj/issues/4579. I think jj log -r 'ancestors(.., nth_parent=1)' PATH is equivalent to git log --first-parent PATH, which should be fast in merge-heavy repos. This can't replace all use cases of jj log PATH, but it's relatively easy to implement.
Would just caching merge results somehow make a meaningful difference? Nixpkgs is indeed very merge‐heavy and the vast majority of those merges are empty.
Would just caching merge results somehow make a meaningful difference?
If -r~merges() PATH is fast enough, yeah, it will save time. I'm not sure if that's the case for nix. To address the original problem, we'll need to avoid unpacking commit/tree objects as much as possible.
It seems better, but unfortunately still unusably slow compared to git log, so I suppose the index is the best approach. That would allow keeping the current behaviour of including merge commits too, right?
@joyously: When I try it, I get
Warning: The argument "README" is being interpreted as a path. To specify a revset, pass -r "README" instead.
Does that indicate a different logic path?
@joyously btw that warning is spurious; we actually do want to interpret README as a path and not a revset, so your test was doing the right thing.
@yuja: Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.
Before computing a full diff for a commit, git-branchless will first perform set intersection on the set of changed paths for a commit, which is a tree object comparison rather than full diffing. This lets it skip most uninteresting commits for certain operations (like deduplication).
- Does jj already do this? Is it computing a full diff for these cases? If so, maybe it's worth a try to add that fast path before building a persistent index?
- Also, git-branchless will do this in parallel; not sure if jj behaves differently.
- This also generalizes the behavior of skipping commits which are unchanged with respect to a certain parent... or maybe not, given that @martinvonz's statement that jj still tries to redo the merge.
- Possibly this suggestion isn't relevant if
jj logis primarily slow due to merge commits. But @emilazy suggests that it's still "unusably slow compared togit log" even when excluding merges, so maybe it's relevant?
I know that Sapling has a specific system/service to serve blame/annotate queries, which would presumably be relevant for the case of following the log for a specific file.
- I glanced through the public docs and the
derived_data/directory, but didn't see anything obvious to suggest that e.g. there's an index keyed by file path rather than commit, which would be architecturally interesting. - To be clear, I think everyone is proposing an index with the commit as the key and the set of changed paths as the value? I just wanted to see if Sapling did it differently.
For reference, here's some documentation on the "bloom filter" idea mentioned for Git: https://github.com/git/git/blob/25b0f41288718625b18495de23cc066394c09a92/Documentation/gitformat-commit-graph.txt#L136-L164.
- All commit-graph data is indexed with the commit as the key.
- I don't know if people are really using the bloom filter feature in practice.
- Interestingly, changes to more than 512 paths have all bits set to
1, which seems like it might not be as useful at scale for merge commits?
Would profiling traces be useful, by the way? Although anyone can generate them just by poking around with Jujutsu on Nixpkgs :)
Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.
Yes, I think that's what we'll have to do, but it's clearly a bit of work to implement.
- Does jj already do this? Is it computing a full diff for these cases? If so, maybe it's worth a try to add that fast path before building a persistent index?
Yes, we already don't attempt to merge file contents in files that are not matched against. The code is here:
https://github.com/martinvonz/jj/blob/a6c18e8353d12085abe5e03f210987430f4ebd53/lib/src/default_index/revset_engine.rs#L1240-L1275
Because we pass in the file matcher into diff_stream, we won't call resolve_file_values() for files that don't match.
- Also, git-branchless will do this in parallel; not sure if jj behaves differently.
Do you mean that git-branchless does it in parallel for multiple commits? How many commits ahead?
- Possibly this suggestion isn't relevant if
jj logis primarily slow due to merge commits. But @emilazy suggests that it's still "unusably slow compared togit log" even when excluding merges, so maybe it's relevant?
I'm a bit surprised that jj is so much slower than Git when there are no merges and Git's commit graph is disabled. I can't find it now but I think I saw some say that. I don't know why that would be. Profiling might help.
FWIW:
yuyuko:~/Developer/nixpkgs (empty) (no description set)
❭ hyperfine --warmup 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25' 'git log --oneline pkgs/stdenv/linux/default.nix | head -n 25'
Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25
Time (mean ± σ): 11.447 s ± 0.142 s [User: 10.777 s, System: 0.637 s]
Range (min … max): 11.302 s … 11.734 s 10 runs
Benchmark 2: git log --oneline pkgs/stdenv/linux/default.nix | head -n 25
Time (mean ± σ): 2.138 s ± 0.042 s [User: 1.960 s, System: 0.179 s]
Range (min … max): 2.090 s … 2.201 s 10 runs
Summary
git log --oneline pkgs/stdenv/linux/default.nix | head -n 25 ran
5.35 ± 0.12 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25
So I would say it is still pretty dire even if you skip merges, assuming my methodology here is reasonable. (I don’t know how to tell whether Git’s commit graph is enabled or not.)
(I don’t know how to tell whether Git’s commit graph is enabled or not.)
-c core.commitGraph=true/false, but I'm not sure if setting it to true guarantees that it's used. It might be that it just allows it to be used (if the had already been created by some other process, perhaps). But you can tell if it makes a difference, of course. In the git.git repo, the difference is smaller than I expected: 1.6 s vs 1.8 s for git log t/test-lib.sh
assuming my methodology here is reasonable
The templates are slightly different, so that have some effect. Passing -T commit_id to jj log and --pretty=%h to git log should make it more fair, but it may not make any difference anyway.
-T "commit_id ++ \"\n\"" seems to be comparably slow (actually… several seconds slower? But I don’t know if that’s just my ambient system load or thermals or whatever).
Turning off the commit graph narrows the gap to Jujutsu being “only” twice as slow:
yuyuko:~/Developer/nixpkgs 128 (empty) (no description set)
❭ hyperfine --warmup 2 'git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50' 'git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50'
Benchmark 1: git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50
Time (mean ± σ): 5.380 s ± 0.077 s [User: 5.101 s, System: 0.269 s]
Range (min … max): 5.291 s … 5.517 s 10 runs
Benchmark 2: git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50
Time (mean ± σ): 2.877 s ± 0.047 s [User: 2.653 s, System: 0.224 s]
Range (min … max): 2.790 s … 2.934 s 10 runs
Summary
git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50 ran
1.87 ± 0.04 times faster than git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50
So it’s probably a pretty big deal here, although that still leaves a lot of performance gap left to be explained.
Turning off the commit graph narrows the gap to Jujutsu being “only” twice as slow:
Maybe most of the remainder is what Yuya mentioned in https://github.com/martinvonz/jj/issues/4674#issuecomment-2423476227
Doesn’t my use of ::@ ~ merges() rule that out?
Edit: Oh, it’s about pruning what branches it even looks at?
Correct. We could skip entire branches the file has not changed on (but we risk missing cases where it was changed and then reverted).
I suppose I’d prefer focusing on things like indexing and caching rather than things that might change semantics. (Not that I’m the one doing the work.)
Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, SSD, no other workloads on the system.
Use https://github.com/NixOS/nixpkgs default.nix to test.
Compare jj log -r "::@ ~ merges()" default.nix to git -c core.commitGraph=false log default.nix, 7 times slower:
$ hyperfine --warmup 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25' 'git -c core.commitGraph=false log --oneline default.nix | head -n 25'
Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25
Time (mean ± σ): 18.386 s ± 0.115 s [User: 17.576 s, System: 0.772 s]
Range (min … max): 18.184 s … 18.563 s 10 runs
Benchmark 2: git -c core.commitGraph=false log --oneline default.nix | head -n 25
Time (mean ± σ): 2.604 s ± 0.022 s [User: 2.449 s, System: 0.151 s]
Range (min … max): 2.589 s … 2.662 s 10 runs
Summary
git -c core.commitGraph=false log --oneline default.nix | head -n 25 ran
7.06 ± 0.07 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25
Compare jj log -r "::@" default.nix to git -c core.commitGraph=false log --full-history default.nix, 100 times slower:
$ hyperfine --warmup 1 --min-runs 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25' 'git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25'
Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25
Time (mean ± σ): 489.629 s ± 1.678 s [User: 487.601 s, System: 1.303 s]
Range (min … max): 488.442 s … 490.815 s 2 runs
Benchmark 2: git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25
Time (mean ± σ): 4.850 s ± 0.020 s [User: 4.628 s, System: 0.212 s]
Range (min … max): 4.836 s … 4.864 s 2 runs
Summary
git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25 ran
100.95 ± 0.54 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25
The different is quite small when setting
core.commitGraphtotrueandfalse:
Compared to https://github.com/martinvonz/jj/issues/4674#issuecomment-2423699309, core.commitGraph make it 2.5 times faster when using nixpkgs default.nix to test:
$ hyperfine --warmup 2 'git -c core.commitGraph=false log default.nix >/dev/null' 'git -c core.commitGraph=true log default.nix >/dev/null'
Benchmark 1: git -c core.commitGraph=false log default.nix >/dev/null
Time (mean ± σ): 2.618 s ± 0.020 s [User: 2.464 s, System: 0.149 s]
Range (min … max): 2.597 s … 2.668 s 10 runs
Benchmark 2: git -c core.commitGraph=true log default.nix >/dev/null
Time (mean ± σ): 1.032 s ± 0.011 s [User: 0.949 s, System: 0.081 s]
Range (min … max): 1.020 s … 1.056 s 10 runs
Summary
git -c core.commitGraph=true log default.nix >/dev/null ran
2.54 ± 0.03 times faster than git -c core.commitGraph=false log default.nix >/dev/null
I think https://github.com/NixOS/nixpkgs default.nix is a good test case for this issue.
FWIW, there are less than 25 commits touching default.nix, so the | heads there will not stop the entire history from being scanned. The file also just almost never changed. That definitely accentuates the problems here – 100× slower is scary! – but I just wanted ot point out that it does mean it reflects less well the common use‐case of wanting to see one pager screen’s worth of log output quickly for a file that has been changed at least semi‐recently.
Yes, I think it's more fair to compare the time it takes to produce the full output. Because the jj log output doesn't include merge commits, it needs to walk further. The git log case finds some merge commits and can therefore terminate the walk earlier.