git-sizer icon indicating copy to clipboard operation
git-sizer copied to clipboard

How are the thresholds for the levels of concern determined and/or updated?

Open dscho opened this issue 1 year ago • 1 comments

Over time, not only repositories grow, but also Git's (and hardware's) ability to cope with larger repositories.

I tried to figure out how the thresholds used by git sizer were determined, so that I might have a chance to re-run experiments to figure out which thresholds could be raised, if any.

But all I found was https://github.com/github/git-sizer/commit/6283777c16c03536de9d5e2ada6f94ad4be735e4 and I was not really able to understand where those numbers came from and which experiments could be run to validate or adjust them.

Any hints?

dscho avatar Oct 07 '24 11:10 dscho

Also very interested in this, in our case especially the directory count. It would be awesome to document an understanding of why the thresholds were selected, and what the impact of exceeding them is expected to be.

pettermahlen avatar Oct 09 '24 06:10 pettermahlen

The thresholds came from measurements of large but well-organized and functional public repositories plus our experience hosting hundreds of millions of repositories, some of them enormous and/or not well-organized. Some of the numbers have a direct and critical effect on Git's (or other tools') ability to work with a repository; others are just plausible "reasonable" values. The thresholds are meant to be a way of expressing to Git users dimensions that seem OK versus ones that seem questionable or downright problematic. We use this tool all the time to help diagnose problems with repositories.

We could consider updating the numbers over time due to Moore's law or due to technical improvements in Git, but that would change how existing repositories are presented, ruining people's intuition for the "level of concern" asterisk patterns. Alternatively, you could gradually adjust your own intuition for what number of asterisks are problematic given your own system and constraints and the patience and expectations of your collaborators.

Specifically regarding directory count (by which I assume that you mean "Biggest checkouts → Number of directories"), that tends not usually to have a dramatic effect for hosting git repositories and more of an effect on developers, as their checkouts, git status, git grep, branch changes, CI, etc. get slower.

mhagger avatar Jan 02 '25 17:01 mhagger

Thank you for the explanation. To be honest, I had hoped for a more repeatable, scientifically-rigorous method that we could now use to update the statistics to reflect the improvements that have been made in technology in general as well as in Git. It is good to know for certain how those thresholds were determined in any case, so once again: thank you for patiently explaining!

dscho avatar Jan 04 '25 11:01 dscho