jj icon indicating copy to clipboard operation
jj copied to clipboard

Viewing logs of specific path in large repo is much slower compared to git

Open azuwis opened this issue 1 year ago • 47 comments

Description

For example, viewing logs of README in linux kernel repo, git takes about 0.9s, jj takes 68s, 70 times slower.

Steps to Reproduce the Problem

$ git clone https://github.com/torvalds/linux.git linux-git
$ cd linux-git
$ time git log -- README >/dev/null

real    0m0.936s
user    0m0.732s
sys     0m0.204s
$ jj git clone https://github.com/torvalds/linux.git linux-jj
$ cd linux-jj
$ time jj log --no-graph README >/dev/null

real    1m8.155s
user    1m8.037s
sys     0m0.597s

Expected Behavior

The performance of viewing specific path in large repo should be comparable to git.

Actual Behavior

See above.

Specifications

  • Platform: Linux
  • Version: 0.22

azuwis avatar Oct 18 '24 16:10 azuwis

Did you try timing the second time you do a jj log command? (Is the first one doing imports from Git for the whole repo or would that be done on the clone command?)

joyously avatar Oct 18 '24 17:10 joyously

The time is consistent for a second time run, you will immediately notice the slowness when viewing logs of an specific path in a large repo like https://github.com/NixOS/nixpkgs or https://github.com/torvalds/linux, colocated or not.

azuwis avatar Oct 18 '24 17:10 azuwis

IIRC from discussing on the Discord in the past there are two main contributors to this:

  1. The default log template notes whether commits are empty; for merge commits, this involves merging the parents and checking if there is any difference (e.g. because of conflicts being resolved), and there is currently no caching of this. You can customize the templates (potentially per–repository) to avoid this.

  2. Git does fancy smart bloom filter stuff that I don‘t understand to skip over commits when searching the log like that. Jujutsu doesn‘t do any of that at present. Git also has slightly different behaviour: merge commits are never shown at all in git log ‹path›.

emilazy avatar Oct 18 '24 17:10 emilazy

You can customize the templates (potentially per–repository) to avoid this.

$ time jj log --template commit_id --no-graph README >/dev/null
real    1m7.938s
user    1m7.849s
sys     0m0.572s

Showing a simple template like commit_id take about the same time as the default builtin_log_compact template.

azuwis avatar Oct 18 '24 17:10 azuwis

The sys times of both are very similar. It's the real time that is inflated.

joyously avatar Oct 18 '24 17:10 joyously

When I try it, I get

Warning: The argument "README" is being interpreted as a path. To specify a revset, pass -r "README" instead.

Does that indicate a different logic path?

joyously avatar Oct 18 '24 17:10 joyously

Are you using commit-graph https://git-scm.com/docs/commit-graph? Can you try with commit graph disabled? (I think git -c core.commitGraph=false log ...)

arxanas avatar Oct 18 '24 18:10 arxanas

  1. You can customize the templates (potentially per–repository) to avoid this.

That's not actually enough because we still do the same calculation to even tell whether to include a merge commit in the result. I think Git instead compares the merge commit against each of its parents to see if the file is modified there. I suppose we should be able to do that as a first step and only do the merge if more than one diff contains the file.

martinvonz avatar Oct 18 '24 18:10 martinvonz

Git appears to omit merged branches if one of the parents has identical content. That's probably the main factor why git log README is super fast in (merge-heavy) Linux repo. https://github.com/git/git/blob/v2.47.0/revision.c#L1200

git -c core.commitGraph=false log --full-history README is still faster than jj, but the performance is closer.

yuja avatar Oct 19 '24 02:10 yuja

Are you using commit-graph https://git-scm.com/docs/commit-graph? Can you try with commit graph disabled? (I think git -c core.commitGraph=false log ...)

The different is quite small when setting core.commitGraph to true and false:

$ hyperfine -w 1 'git -c core.commitGraph=false log README >/dev/null'
Benchmark 1: git -c core.commitGraph=false log README >/dev/null
  Time (mean ± σ):     926.2 ms ±   3.4 ms    [User: 748.5 ms, System: 177.4 ms]
  Range (min … max):   922.2 ms … 932.3 ms    10 runs

$ hyperfine -w 1 'git -c core.commitGraph=true log README >/dev/null'
Benchmark 1: git -c core.commitGraph=true log README >/dev/null
  Time (mean ± σ):     925.5 ms ±   1.8 ms    [User: 749.9 ms, System: 175.5 ms]
  Range (min … max):   923.1 ms … 928.9 ms    10 runs

azuwis avatar Oct 19 '24 08:10 azuwis

Git appears to omit merged branches if one of the parents has identical content.

Good idea!

git -c core.commitGraph=false log --full-history README is still faster than jj, but the performance is closer.

I can confirm: On my machine, that takes about 17 seconds instead of just under 1 second without --full-history.

martinvonz avatar Oct 20 '24 00:10 martinvonz

How would one approach fixing this? I've been using jj daily on Nixpkgs for about 2 weeks now and the speed is by far one of only two things that's been holding me back from recommending it to everyone around me 😅 I had to run jj log -r "reachable(@, mutable())" constantly to avoid spending ~20 seconds on the log alone

pluiedev avatar Nov 20 '24 08:11 pluiedev

Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.

In order to implement git log (without --full-history) behavior, we'll need to add a revset function that queries commits by file paths but omits "uninteresting" branches. Maybe this revset function can leverage graph iterator internally, but we'll need some modifications on it.

Another related issue is https://github.com/martinvonz/jj/issues/4579. I think jj log -r 'ancestors(.., nth_parent=1)' PATH is equivalent to git log --first-parent PATH, which should be fast in merge-heavy repos. This can't replace all use cases of jj log PATH, but it's relatively easy to implement.

yuja avatar Nov 20 '24 10:11 yuja

Would just caching merge results somehow make a meaningful difference? Nixpkgs is indeed very merge‐heavy and the vast majority of those merges are empty.

emilazy avatar Nov 20 '24 13:11 emilazy

Would just caching merge results somehow make a meaningful difference?

If -r~merges() PATH is fast enough, yeah, it will save time. I'm not sure if that's the case for nix. To address the original problem, we'll need to avoid unpacking commit/tree objects as much as possible.

yuja avatar Nov 20 '24 14:11 yuja

It seems better, but unfortunately still unusably slow compared to git log, so I suppose the index is the best approach. That would allow keeping the current behaviour of including merge commits too, right?

emilazy avatar Nov 20 '24 14:11 emilazy

@joyously: When I try it, I get

Warning: The argument "README" is being interpreted as a path. To specify a revset, pass -r "README" instead.

Does that indicate a different logic path?

@joyously btw that warning is spurious; we actually do want to interpret README as a path and not a revset, so your test was doing the right thing.


@yuja: Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.

Before computing a full diff for a commit, git-branchless will first perform set intersection on the set of changed paths for a commit, which is a tree object comparison rather than full diffing. This lets it skip most uninteresting commits for certain operations (like deduplication).

  • Does jj already do this? Is it computing a full diff for these cases? If so, maybe it's worth a try to add that fast path before building a persistent index?
  • Also, git-branchless will do this in parallel; not sure if jj behaves differently.
  • This also generalizes the behavior of skipping commits which are unchanged with respect to a certain parent... or maybe not, given that @martinvonz's statement that jj still tries to redo the merge.
  • Possibly this suggestion isn't relevant if jj log is primarily slow due to merge commits. But @emilazy suggests that it's still "unusably slow compared to git log" even when excluding merges, so maybe it's relevant?

I know that Sapling has a specific system/service to serve blame/annotate queries, which would presumably be relevant for the case of following the log for a specific file.

  • I glanced through the public docs and the derived_data/ directory, but didn't see anything obvious to suggest that e.g. there's an index keyed by file path rather than commit, which would be architecturally interesting.
  • To be clear, I think everyone is proposing an index with the commit as the key and the set of changed paths as the value? I just wanted to see if Sapling did it differently.

For reference, here's some documentation on the "bloom filter" idea mentioned for Git: https://github.com/git/git/blob/25b0f41288718625b18495de23cc066394c09a92/Documentation/gitformat-commit-graph.txt#L136-L164.

  • All commit-graph data is indexed with the commit as the key.
  • I don't know if people are really using the bloom filter feature in practice.
  • Interestingly, changes to more than 512 paths have all bits set to 1, which seems like it might not be as useful at scale for merge commits?

arxanas avatar Nov 20 '24 19:11 arxanas

Would profiling traces be useful, by the way? Although anyone can generate them just by poking around with Jujutsu on Nixpkgs :)

emilazy avatar Nov 20 '24 19:11 emilazy

Perhaps, the conceptually simplest option is to add an index of file paths that touch commits. Git builds a bloom filter for that purpose, but there might be a better option.

Yes, I think that's what we'll have to do, but it's clearly a bit of work to implement.

  • Does jj already do this? Is it computing a full diff for these cases? If so, maybe it's worth a try to add that fast path before building a persistent index?

Yes, we already don't attempt to merge file contents in files that are not matched against. The code is here:

https://github.com/martinvonz/jj/blob/a6c18e8353d12085abe5e03f210987430f4ebd53/lib/src/default_index/revset_engine.rs#L1240-L1275

Because we pass in the file matcher into diff_stream, we won't call resolve_file_values() for files that don't match.

  • Also, git-branchless will do this in parallel; not sure if jj behaves differently.

Do you mean that git-branchless does it in parallel for multiple commits? How many commits ahead?

  • Possibly this suggestion isn't relevant if jj log is primarily slow due to merge commits. But @emilazy suggests that it's still "unusably slow compared to git log" even when excluding merges, so maybe it's relevant?

I'm a bit surprised that jj is so much slower than Git when there are no merges and Git's commit graph is disabled. I can't find it now but I think I saw some say that. I don't know why that would be. Profiling might help.

martinvonz avatar Nov 20 '24 21:11 martinvonz

FWIW:

yuyuko:~/Developer/nixpkgs  (empty) (no description set)
❭ hyperfine --warmup 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25' 'git log --oneline pkgs/stdenv/linux/default.nix | head -n 25'
Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25
  Time (mean ± σ):     11.447 s ±  0.142 s    [User: 10.777 s, System: 0.637 s]
  Range (min … max):   11.302 s … 11.734 s    10 runs

Benchmark 2: git log --oneline pkgs/stdenv/linux/default.nix | head -n 25
  Time (mean ± σ):      2.138 s ±  0.042 s    [User: 1.960 s, System: 0.179 s]
  Range (min … max):    2.090 s …  2.201 s    10 runs

Summary
  git log --oneline pkgs/stdenv/linux/default.nix | head -n 25 ran
    5.35 ± 0.12 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" pkgs/stdenv/linux/default.nix | head -n 25

So I would say it is still pretty dire even if you skip merges, assuming my methodology here is reasonable. (I don’t know how to tell whether Git’s commit graph is enabled or not.)

emilazy avatar Nov 20 '24 21:11 emilazy

(I don’t know how to tell whether Git’s commit graph is enabled or not.)

-c core.commitGraph=true/false, but I'm not sure if setting it to true guarantees that it's used. It might be that it just allows it to be used (if the had already been created by some other process, perhaps). But you can tell if it makes a difference, of course. In the git.git repo, the difference is smaller than I expected: 1.6 s vs 1.8 s for git log t/test-lib.sh

assuming my methodology here is reasonable

The templates are slightly different, so that have some effect. Passing -T commit_id to jj log and --pretty=%h to git log should make it more fair, but it may not make any difference anyway.

martinvonz avatar Nov 20 '24 21:11 martinvonz

-T "commit_id ++ \"\n\"" seems to be comparably slow (actually… several seconds slower? But I don’t know if that’s just my ambient system load or thermals or whatever).

Turning off the commit graph narrows the gap to Jujutsu being “only” twice as slow:

yuyuko:~/Developer/nixpkgs  128  (empty) (no description set)
❭ hyperfine --warmup 2 'git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50' 'git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50'
Benchmark 1: git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50
  Time (mean ± σ):      5.380 s ±  0.077 s    [User: 5.101 s, System: 0.269 s]
  Range (min … max):    5.291 s …  5.517 s    10 runs

Benchmark 2: git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50
  Time (mean ± σ):      2.877 s ±  0.047 s    [User: 2.653 s, System: 0.224 s]
  Range (min … max):    2.790 s …  2.934 s    10 runs

Summary
  git -c core.commitGraph=true log --oneline pkgs/stdenv/linux/default.nix | head -n 50 ran
    1.87 ± 0.04 times faster than git -c core.commitGraph=false log --oneline pkgs/stdenv/linux/default.nix | head -n 50

So it’s probably a pretty big deal here, although that still leaves a lot of performance gap left to be explained.

emilazy avatar Nov 20 '24 21:11 emilazy

Turning off the commit graph narrows the gap to Jujutsu being “only” twice as slow:

Maybe most of the remainder is what Yuya mentioned in https://github.com/martinvonz/jj/issues/4674#issuecomment-2423476227

martinvonz avatar Nov 20 '24 21:11 martinvonz

Doesn’t my use of ::@ ~ merges() rule that out?

Edit: Oh, it’s about pruning what branches it even looks at?

emilazy avatar Nov 20 '24 21:11 emilazy

Correct. We could skip entire branches the file has not changed on (but we risk missing cases where it was changed and then reverted).

martinvonz avatar Nov 20 '24 21:11 martinvonz

I suppose I’d prefer focusing on things like indexing and caching rather than things that might change semantics. (Not that I’m the one doing the work.)

emilazy avatar Nov 20 '24 21:11 emilazy

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, SSD, no other workloads on the system.

Use https://github.com/NixOS/nixpkgs default.nix to test.

Compare jj log -r "::@ ~ merges()" default.nix to git -c core.commitGraph=false log default.nix, 7 times slower:

$ hyperfine --warmup 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25' 'git -c core.commitGraph=false log --oneline default.nix | head -n 25'

Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25
  Time (mean ± σ):     18.386 s ±  0.115 s    [User: 17.576 s, System: 0.772 s]
  Range (min … max):   18.184 s … 18.563 s    10 runs
 
Benchmark 2: git -c core.commitGraph=false log --oneline default.nix | head -n 25
  Time (mean ± σ):      2.604 s ±  0.022 s    [User: 2.449 s, System: 0.151 s]
  Range (min … max):    2.589 s …  2.662 s    10 runs
 
Summary
  git -c core.commitGraph=false log --oneline default.nix | head -n 25 ran
    7.06 ± 0.07 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@ ~ merges()" default.nix | head -n 25

Compare jj log -r "::@" default.nix to git -c core.commitGraph=false log --full-history default.nix, 100 times slower:

$ hyperfine --warmup 1 --min-runs 2 'jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25' 'git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25'

Benchmark 1: jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25
  Time (mean ± σ):     489.629 s ±  1.678 s    [User: 487.601 s, System: 1.303 s]
  Range (min … max):   488.442 s … 490.815 s    2 runs

Benchmark 2: git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25
  Time (mean ± σ):      4.850 s ±  0.020 s    [User: 4.628 s, System: 0.212 s]
  Range (min … max):    4.836 s …  4.864 s    2 runs
 
Summary
  git -c core.commitGraph=false log --oneline --full-history default.nix | head -n 25 ran
  100.95 ± 0.54 times faster than jj --ignore-working-copy --no-graph -T builtin_log_oneline -r "::@" default.nix | head -n 25

azuwis avatar Nov 21 '24 02:11 azuwis

The different is quite small when setting core.commitGraph to true and false:

Compared to https://github.com/martinvonz/jj/issues/4674#issuecomment-2423699309, core.commitGraph make it 2.5 times faster when using nixpkgs default.nix to test:

$ hyperfine --warmup 2 'git -c core.commitGraph=false log default.nix >/dev/null' 'git -c core.commitGraph=true log default.nix >/dev/null'

Benchmark 1: git -c core.commitGraph=false log default.nix >/dev/null
  Time (mean ± σ):      2.618 s ±  0.020 s    [User: 2.464 s, System: 0.149 s]
  Range (min … max):    2.597 s …  2.668 s    10 runs
 
Benchmark 2: git -c core.commitGraph=true log default.nix >/dev/null
  Time (mean ± σ):      1.032 s ±  0.011 s    [User: 0.949 s, System: 0.081 s]
  Range (min … max):    1.020 s …  1.056 s    10 runs
 
Summary
  git -c core.commitGraph=true log default.nix >/dev/null ran
    2.54 ± 0.03 times faster than git -c core.commitGraph=false log default.nix >/dev/null

I think https://github.com/NixOS/nixpkgs default.nix is a good test case for this issue.

azuwis avatar Nov 21 '24 03:11 azuwis

FWIW, there are less than 25 commits touching default.nix, so the | heads there will not stop the entire history from being scanned. The file also just almost never changed. That definitely accentuates the problems here – 100× slower is scary! – but I just wanted ot point out that it does mean it reflects less well the common use‐case of wanting to see one pager screen’s worth of log output quickly for a file that has been changed at least semi‐recently.

emilazy avatar Nov 21 '24 03:11 emilazy

Yes, I think it's more fair to compare the time it takes to produce the full output. Because the jj log output doesn't include merge commits, it needs to walk further. The git log case finds some merge commits and can therefore terminate the walk earlier.

martinvonz avatar Nov 21 '24 05:11 martinvonz