repo_info_extractor
repo_info_extractor copied to clipboard
Filter git commit log by author email
When extracting info from git repositories, currently the full git log is being processed in chunks of 1000 commits, with commands similar to this one:
git log --numstat --all --skip=14000 --max-count=1000 --pretty=format:|||BEGIN|||%H|||SEP|||%an|||SEP|||%ae|||SEP|||%ad --no-merges
(coming from this code)
Since the list of commit author email addresses to care about is known in advance, and passed on the command line, I expect it would be possible to ask git log to retrieve only commits by the matching author email addresses, like this:
git log --numstat --all [email protected] [email protected] --pretty=format:|||BEGIN|||%H|||SEP|||%an|||SEP|||%ae|||SEP|||%ad --no-merges
I expect that would be more efficient since only the commits authored by the interesting email addresses would be processed instead of every commit. It might be especially interesting for large repos which are currently skipped of there are more than 20000 commits.
As a reference point, retrieving all commits belonging to a single email address in a git repository containing 697983 commits, takes only 7 seconds with this approach.
It might interfere with email similarity matching, though that may be approached by listing all author emails first with git log --pretty="format:%ae" --no-merges | sort -u or equivalent, and then processing that list for similarity matching (in case it does not work like that already).
I'm opening this issue to learn if it is an interesting direction in the first place, before anyone invests effort into implementing this.