ggshield
ggshield copied to clipboard
Improve scan repo perfs
Context
The ggshield secret scan repo
command used to make one API call per commit. This implementation was not efficient and could result in thousands of call for important repos. In most cases, one commit contains modifications to only a couple of files. This implies a lot of very small API calls.
We propose an implementation to group commits together as long as the total number of files sent to the API is below a fixed limit (20 at the time of writing).
The PR
Part 1 - Improve scan repo logic
The implementation modifies scan_commit_range
as follows : we create an intermediary steps where commits are iterated over and grouped to form batches of at most 20 files. These are ultimately sent to the API using the Files.scan
interface.
We also adapted the results parsing logic : now that one API call contains files from several commits, we have to re-create the link between the results and the commit they come from.
One small drawback is that we lose this commit granularity when it comes to errors : we cannot tell which commit triggered which error.
Part 2 - Improve progress bars
While working on this PR, we experimented broken progress bars. Especially when it comes to scanning very big repos. We therefore propose several improvements to our progress bars :
- We add
rich
as a dependency. It provides a nice and smooth management of progress bars. We may also use it in the future for nice output displays. - We introduce a callback logic in the
Scanner.scan
method : this is in order to be able to update progress bars whenever a batch of content is sent to the API. - We use this logic in most of our cmds :
pypi
,docker
,repo
,archive
,path
. - Note that the progress bar is sent to stderr not to interact with other outputs, json for instance.
Here is an example of results :
Codecov Report
Merging #352 (6781384) into main (b265885) will decrease coverage by
0.12%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #352 +/- ##
==========================================
- Coverage 94.31% 94.19% -0.13%
==========================================
Files 75 75
Lines 3078 3100 +22
==========================================
+ Hits 2903 2920 +17
- Misses 175 180 +5
Flag | Coverage Δ | |
---|---|---|
unittests | 94.19% <100.00%> (-0.13%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
ggshield/scan/repo.py | 100.00% <100.00%> (ø) |
|
ggshield/core/git_shell.py | 80.00% <0.00%> (-4.22%) |
:arrow_down: |
ggshield/core/utils.py | 92.30% <0.00%> (-0.44%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I integrated your remarks, thanks for the review.
Thanks for the thorough review.