ggshield icon indicating copy to clipboard operation
ggshield copied to clipboard

Improve scan repo perfs

Open pierrelalanne opened this issue 2 years ago • 1 comments

Context

The ggshield secret scan repo command used to make one API call per commit. This implementation was not efficient and could result in thousands of call for important repos. In most cases, one commit contains modifications to only a couple of files. This implies a lot of very small API calls.

We propose an implementation to group commits together as long as the total number of files sent to the API is below a fixed limit (20 at the time of writing).

The PR

Part 1 - Improve scan repo logic

The implementation modifies scan_commit_range as follows : we create an intermediary steps where commits are iterated over and grouped to form batches of at most 20 files. These are ultimately sent to the API using the Files.scan interface.
We also adapted the results parsing logic : now that one API call contains files from several commits, we have to re-create the link between the results and the commit they come from.

One small drawback is that we lose this commit granularity when it comes to errors : we cannot tell which commit triggered which error.

Part 2 - Improve progress bars

While working on this PR, we experimented broken progress bars. Especially when it comes to scanning very big repos. We therefore propose several improvements to our progress bars :

  • We add rich as a dependency. It provides a nice and smooth management of progress bars. We may also use it in the future for nice output displays.
  • We introduce a callback logic in the Scanner.scan method : this is in order to be able to update progress bars whenever a batch of content is sent to the API.
  • We use this logic in most of our cmds : pypi, docker, repo, archive, path.
  • Note that the progress bar is sent to stderr not to interact with other outputs, json for instance.

Here is an example of results :
image

pierrelalanne avatar Sep 08 '22 09:09 pierrelalanne

Codecov Report

Merging #352 (6781384) into main (b265885) will decrease coverage by 0.12%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #352      +/-   ##
==========================================
- Coverage   94.31%   94.19%   -0.13%     
==========================================
  Files          75       75              
  Lines        3078     3100      +22     
==========================================
+ Hits         2903     2920      +17     
- Misses        175      180       +5     
Flag Coverage Δ
unittests 94.19% <100.00%> (-0.13%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ggshield/scan/repo.py 100.00% <100.00%> (ø)
ggshield/core/git_shell.py 80.00% <0.00%> (-4.22%) :arrow_down:
ggshield/core/utils.py 92.30% <0.00%> (-0.44%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Sep 08 '22 13:09 codecov-commenter

I integrated your remarks, thanks for the review.

pierrelalanne avatar Sep 27 '22 14:09 pierrelalanne

Thanks for the thorough review.

pierrelalanne avatar Sep 27 '22 16:09 pierrelalanne