grimoirelab-perceval icon indicating copy to clipboard operation
grimoirelab-perceval copied to clipboard

Warning related to git rename detection

Open jgbarah opened this issue 9 years ago • 4 comments

It seems the way Perceval runs git to get the git log could miss some renaming in some cases (this is with the Linux kernel git repo).

DEBUG:Git https://github.com/torvalds/linux.git repository cloned into /tmp/tmplwaiec1i/torvalds/linux
DEBUG:Running command git fetch origin (cwd: /tmp/tmplwaiec1i/torvalds/linux, env: {'LANG': 'C'})
DEBUG:
DEBUG:Running command git reset --hard origin (cwd: /tmp/tmplwaiec1i/torvalds/linux, env: {'LANG': 'C'})
DEBUG:
DEBUG:Git https://github.com/torvalds/linux.git repository pulled into /tmp/tmplwaiec1i/torvalds/linux
DEBUG:Running command git log --raw --numstat --pretty=fuller --decorate=full --all --reverse --topo-order --parents -M -C -c --remotes=origin (cwd: /tmp/tmplwaiec1i/torvalds/linux, env: {'PAGER': '', 'LANG': 'C'})
DEBUG:warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 1569 and retry the command.

DEBUG:Git log fetched from https://github.com/torvalds/linux.git repository (/tmp/tmplwaiec1i/torvalds/linux)

So, git log is not able of tracking renaming to the extent it could, because of that limit. I reproduced that running git log from the command line:

$ git log --raw --numstat --pretty=fuller --decorate=full --all --reverse --topo-order --parents -M -C -c --remotes=origin > linux.log
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 1569 and retry the command.

Maybe this is a corner case, but probably trying to consider it wouldn't harm. I wonder if we could pass some option or whatever to git log when running it from Perceval, to avoid these cases...

jgbarah avatar Mar 16 '16 08:03 jgbarah

We can pass diff.renameLimit to the command. Something like this:

git -c diff.renameLimit=99999 log --raw --numstat --pretty=fuller --decorate=full --all --reverse --topo-order --parents -M -C -c --remotes=origin

After searching for this case around the internet, I found that the algorithm used by git to detect renames can be quadratic. In most of the cases it won't harm but in cases like big repositories as Linux it would be a problem.

Maybe you can check how long git lasts to generate the log of Linux when diff.renameLimit is set. Yo can make the change here:

https://github.com/grimoirelab/perceval/blob/master/perceval/backends/git.py#L635

With the results, we can decide whether set this parameter by default or let it optional to the user.

sduenas avatar Mar 16 '16 12:03 sduenas

I'm testing this. It seems using diff.renameLimit=99999 is taking much more. I hope to have numbers by tomorrow.

jgbarah avatar Mar 16 '16 23:03 jgbarah

Using git log defaults (I guess that's diff.renameLimit = 1000):

$ time git log --raw --numstat --pretty=fuller --decorate=full --all --reverse --topo-order --parents \
  -M -C -c --remotes=origin > linux.gitlog
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 1569 and retry the command.

real    111m53.240s
user    110m55.348s
sys 0m47.976s

Extending the limit to the maximum:

$ time git -c diff.renameLimit=99999 log --raw --numstat --pretty=fuller --decorate=full --all --reverse \
  --topo-order --parents -M -C -c --remotes=origin > linux.gitlog

real    1187m51.274s
user    1134m56.972s
sys 5m1.344s

And disabling analysis of renaming:

$ time git -c diff.renames=0 log --raw --numstat --pretty=fuller --decorate=full --all --reverse --topo-order --parents -M -C -c --remotes=origin > linux.gitlog-0
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 1569 and retry the command.

real    125m46.317s
user    124m35.332s
sys 1m2.008s

Therefore, it seems that the default values are the best option for Perceval. Extending to the maximum makes things an order of magnitude slower for the Linux kernel. Removing renaming analysis doesn't improve the situation.

However, this could be a parameter to pass to Perceval/git, if the user is not interested in the default, but in a full analysis. If you want, I can produce a PR with a proposal for this change.

jgbarah avatar Mar 17 '16 19:03 jgbarah

Therefore, it seems that the default values are the best option for Perceval. Extending to the maximum makes things an order of magnitude slower for the Linux kernel. Removing renaming analysis doesn't improve the situation. However, this could be a parameter to pass to Perceval/git, if the user is not interested in the default, but in a full analysis. If you want, I can produce a PR with a proposal for this change.

Ok, that makes sense. Go ahead with the PR.

sduenas avatar Mar 18 '16 11:03 sduenas

Closing this due to inactivity. If someone wants to create a PR to add an option to fix this issue, we will reopen it.

sduenas avatar Oct 11 '23 15:10 sduenas