github-readme-stats
github-readme-stats copied to clipboard
Switch to data-driven percentiles
This PR modifies the ranking system to use the actual quantiles (see #2857) of the statistics. This leads to significantly different, but generally higher, ranks for users. In particular, it is now impossible to be at C rank (minimum is C+). An alternative to lower the ranks would be to filter out inactive users before computing the quantiles.
What do you think @rickstaa, @qwerty541?
@francois-rozet is attempting to deploy a commit to the github readme stats Team on Vercel.
A member of the Team first needs to authorize it.
Codecov Report
Attention: 1 lines
in your changes are missing coverage. Please review.
Comparison is base (
b56689b
) 97.62% compared to head (091e304
) 97.63%. Report is 337 commits behind head on master.
:exclamation: Current head 091e304 differs from pull request most recent head 2cc3e54. Consider uploading reports for the commit 2cc3e54 to get more accurate results
Files | Patch % | Lines |
---|---|---|
src/calculateRank.js | 99.06% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #2962 +/- ##
==========================================
+ Coverage 97.62% 97.63% +0.01%
==========================================
Files 24 24
Lines 5182 5249 +67
Branches 460 463 +3
==========================================
+ Hits 5059 5125 +66
- Misses 122 123 +1
Partials 1 1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Don't have much idea about any of these ranking algorithm.
But whatever you folks think is a better system you can go ahead with but:
- Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
- Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs
@rickstaa @francois-rozet @qwerty541
Don't have much idea about any of these ranking algorithm.
But whatever you folks think is a better system you can go ahead with but:
- Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
- Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs
@rickstaa @francois-rozet @qwerty541
@anuraghazra, @francois-rozet I didn't have time yet to review this, but I will review when I finish my master's thesis. Neverteless, maybe we should not add the QUANTILES themselves in the code 🤔. Sticking the weights and means already in the codebase is better.
https://github.com/anuraghazra/github-readme-stats/blob/24ac78bdb0725ebcb099cbea2fedbc258bdec15a/src/calculateRank.js#L35-L54
This prevents any performance problems from occurring, and if the code contains a comment that links to a RANK_CALC.md
Markdown doc file in which the calculation and quantiles are explained, I think that is enough. What do you think?
The ranking was improved since many users pointed out that it took a lot of work to get a rank other than A+ 😅. There were about three issues/discussions opened by people who were wondering what happened to their rank, but after I explained it to them and showed them that ranking up is now more accessible for the first ranks, no new issues were opened. However, we can document this change and the reasoning in a RANK_CALC.md
doc file 👍🏻 .
To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?
To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?
Hey @francois-rozet sorry for being unclear. My suggestion was that we can perform the calculation based on the quantiles offline and then just store the mean
and weights
in the code. This will prevent any performance issues and keeps the code clean. Then, We can store the calculation and quantiles somewhere and refer to them in a code comment 🤔.
You cannot get the true percentiles just based on the median. Currently, we assume that the statistics follow exponential/log-normal distributions, which implicitly fixes the quantiles with respect to the median/mean. The percentiles we get are basically educated guesses.
If we want the true percentiles, we need the true quantiles.
I have opened pull request #3141 which contains performance tests base implementation. After merging this one and pulling this branch with master we can check how quantiles affect on performance. But personally i do not expect much changes since there are no iterations through these quantiles arrays, it accessed everywhere by indexes.
Actually, there are iterations through the arrays with the .findIndex()
calls. But this can be improved from $O(n)$ to $O(\log n)$ by using a bisection search. I can implement that.
@francois-rozet Yes, you're right, i didn't noticed .findIndex()
call on 2 line at first look.
https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R2
I only noticed a call of the same function on line 136, but since this THRESHOLDS
array contains only 9 elements and unlike from score()
function it was called only once, I didn't attach much importance to it.
https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R136
It would be great if you implement that.
I hope my code below will help.
function findIndex(arr, target) {
let left = 0;
let right = arr.length - 1;
let result = -1;
while (left <= right) {
const mid = Math.floor((left + right) / 2);
if (arr[mid] < target) {
result = mid; // Update the result and continue searching on the right half
left = mid + 1;
} else {
right = mid - 1; // Search on the left half
}
}
return result;
}
const sortedArray = [
0, 0, 0, 0, 2, 4, 8, 11, 15, 19, 23, 27, 32, 36, 41, 45, 50, 55, 60, 65, 71,
76, 82, 87, 93, 99, 105, 111, 117, 124, 131, 137, 145, 151, 159, 166, 174,
182, 190, 198, 207, 215, 225, 234, 244, 253, 264, 274, 285, 296, 306, 318,
330, 342, 355, 368, 382, 396, 409, 424, 440, 457, 475, 493, 512, 531, 551,
570, 593, 618, 643, 667, 695, 723, 752, 784, 815, 857, 893, 934, 984, 1037,
1094, 1152, 1217, 1289, 1379, 1475, 1576, 1696, 1851, 2023, 2232, 2480,
2835, 3242, 3885, 4868, 6614, 11801, 792319,
];
const targetValue = 2333;
const lowerIndex = findIndex(sortedArray, targetValue);
if (lowerIndex !== -1) {
console.log(`Index of the first value lower than ${targetValue} is: ${lowerIndex}. The value is: ${sortedArray[lowerIndex]}`);
} else {
console.log(`No value lower than ${targetValue} found.`);
}
@francois-rozet, @anuraghazra just a small heads up that I just merged @qwerty541 pull requests which allows us to safeguard against performance issues.
Hello @anuraghazra, @rickstaa, @qwerty541, I have (finally) replaced the linear-time search by a log-time bisection search. However, thinking back, I am now a bit skeptical about this PR. It would great if the percentiles were purely "data-driven", but I feel like it is not worth the increase of code complexity (and code (and data) maintenance). Users seem to be quite happy with the current ranking system, which is easy to understand and maintain, even if it is slightly off.
I believe #2637 is a much more important issue for the project.