github-readme-stats icon indicating copy to clipboard operation
github-readme-stats copied to clipboard

Switch to data-driven percentiles

Open francois-rozet opened this issue 1 year ago • 12 comments

This PR modifies the ranking system to use the actual quantiles (see #2857) of the statistics. This leads to significantly different, but generally higher, ranks for users. In particular, it is now impossible to be at C rank (minimum is C+). An alternative to lower the ranks would be to filter out inactive users before computing the quantiles.

What do you think @rickstaa, @qwerty541?

francois-rozet avatar Jul 21 '23 13:07 francois-rozet

@francois-rozet is attempting to deploy a commit to the github readme stats Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Jul 21 '23 13:07 vercel[bot]

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (b56689b) 97.62% compared to head (091e304) 97.63%. Report is 337 commits behind head on master.

:exclamation: Current head 091e304 differs from pull request most recent head 2cc3e54. Consider uploading reports for the commit 2cc3e54 to get more accurate results

Files Patch % Lines
src/calculateRank.js 99.06% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2962      +/-   ##
==========================================
+ Coverage   97.62%   97.63%   +0.01%     
==========================================
  Files          24       24              
  Lines        5182     5249      +67     
  Branches      460      463       +3     
==========================================
+ Hits         5059     5125      +66     
- Misses        122      123       +1     
  Partials        1        1              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 21 '23 13:07 codecov[bot]

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

  • Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
  • Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

anuraghazra avatar Aug 17 '23 17:08 anuraghazra

Don't have much idea about any of these ranking algorithm.

But whatever you folks think is a better system you can go ahead with but:

  • Make sure to do a perf test once (basic test with performance.now()API) to ensure API responses don't get effected since now we have those lookup arrays
  • Document the changes properly because external people might have no idea why their ranks changed and in that case you can refer them to the docs

@rickstaa @francois-rozet @qwerty541

@anuraghazra, @francois-rozet I didn't have time yet to review this, but I will review when I finish my master's thesis. Neverteless, maybe we should not add the QUANTILES themselves in the code 🤔. Sticking the weights and means already in the codebase is better.

https://github.com/anuraghazra/github-readme-stats/blob/24ac78bdb0725ebcb099cbea2fedbc258bdec15a/src/calculateRank.js#L35-L54

This prevents any performance problems from occurring, and if the code contains a comment that links to a RANK_CALC.md Markdown doc file in which the calculation and quantiles are explained, I think that is enough. What do you think?

The ranking was improved since many users pointed out that it took a lot of work to get a rank other than A+ 😅. There were about three issues/discussions opened by people who were wondering what happened to their rank, but after I explained it to them and showed them that ranking up is now more accessible for the first ranks, no new issues were opened. However, we can document this change and the reasoning in a RANK_CALC.md doc file 👍🏻 .

rickstaa avatar Aug 17 '23 18:08 rickstaa

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

francois-rozet avatar Aug 18 '23 15:08 francois-rozet

To be sure, @rickstaa, you think that we should stick to the current system because it is simpler, even though it is a rough approximation of the true quantiles?

Hey @francois-rozet sorry for being unclear. My suggestion was that we can perform the calculation based on the quantiles offline and then just store the mean and weights in the code. This will prevent any performance issues and keeps the code clean. Then, We can store the calculation and quantiles somewhere and refer to them in a code comment 🤔.

rickstaa avatar Aug 18 '23 15:08 rickstaa

You cannot get the true percentiles just based on the median. Currently, we assume that the statistics follow exponential/log-normal distributions, which implicitly fixes the quantiles with respect to the median/mean. The percentiles we get are basically educated guesses.

If we want the true percentiles, we need the true quantiles.

francois-rozet avatar Aug 18 '23 15:08 francois-rozet

I have opened pull request #3141 which contains performance tests base implementation. After merging this one and pulling this branch with master we can check how quantiles affect on performance. But personally i do not expect much changes since there are no iterations through these quantiles arrays, it accessed everywhere by indexes.

qwerty541 avatar Aug 26 '23 05:08 qwerty541

Actually, there are iterations through the arrays with the .findIndex() calls. But this can be improved from $O(n)$ to $O(\log n)$ by using a bisection search. I can implement that.

francois-rozet avatar Aug 26 '23 11:08 francois-rozet

@francois-rozet Yes, you're right, i didn't noticed .findIndex() call on 2 line at first look.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R2

I only noticed a call of the same function on line 136, but since this THRESHOLDS array contains only 9 elements and unlike from score() function it was called only once, I didn't attach much importance to it.

https://github.com/anuraghazra/github-readme-stats/pull/2962/files#diff-d6a36eb9dcbdb8c6168a754251ba0c9318c7348ac6b17001cea3ba5aa84af728R136

It would be great if you implement that.

I hope my code below will help.

function findIndex(arr, target) {
  let left = 0;
  let right = arr.length - 1;
  let result = -1;

  while (left <= right) {
    const mid = Math.floor((left + right) / 2);

    if (arr[mid] < target) {
      result = mid; // Update the result and continue searching on the right half
      left = mid + 1;
    } else {
      right = mid - 1; // Search on the left half
    }
  }

  return result;
}

const sortedArray =  [
    0, 0, 0, 0, 2, 4, 8, 11, 15, 19, 23, 27, 32, 36, 41, 45, 50, 55, 60, 65, 71,
    76, 82, 87, 93, 99, 105, 111, 117, 124, 131, 137, 145, 151, 159, 166, 174,
    182, 190, 198, 207, 215, 225, 234, 244, 253, 264, 274, 285, 296, 306, 318,
    330, 342, 355, 368, 382, 396, 409, 424, 440, 457, 475, 493, 512, 531, 551,
    570, 593, 618, 643, 667, 695, 723, 752, 784, 815, 857, 893, 934, 984, 1037,
    1094, 1152, 1217, 1289, 1379, 1475, 1576, 1696, 1851, 2023, 2232, 2480,
    2835, 3242, 3885, 4868, 6614, 11801, 792319,
  ];
const targetValue = 2333;
const lowerIndex = findIndex(sortedArray, targetValue);

if (lowerIndex !== -1) {
  console.log(`Index of the first value lower than ${targetValue} is: ${lowerIndex}. The value is: ${sortedArray[lowerIndex]}`);
} else {
  console.log(`No value lower than ${targetValue} found.`);
}

qwerty541 avatar Aug 27 '23 00:08 qwerty541

@francois-rozet, @anuraghazra just a small heads up that I just merged @qwerty541 pull requests which allows us to safeguard against performance issues.

rickstaa avatar Oct 13 '23 09:10 rickstaa

Hello @anuraghazra, @rickstaa, @qwerty541, I have (finally) replaced the linear-time search by a log-time bisection search. However, thinking back, I am now a bit skeptical about this PR. It would great if the percentiles were purely "data-driven", but I feel like it is not worth the increase of code complexity (and code (and data) maintenance). Users seem to be quite happy with the current ranking system, which is easy to understand and maintain, even if it is slightly off.

I believe #2637 is a much more important issue for the project.

francois-rozet avatar Dec 06 '23 13:12 francois-rozet