github-readme-stats [Proposal] New Top Language Detection Algorithm

Is your feature request related to a problem? Please describe.

This is related to an earlier issue.
I am not quite happy with the way Top Languages are calculated (just counting the bytes).

Describe the solution you'd like

The new method is inspired by the h-index.

For a language L,
let x = byte count(currently in use) and y = number of repositories using L.

x*y = k^2
k = sqrt(xy)

Now we can use this k-metric to find the top languages and their weightage. The higher the value of k, the more used is the language.

While this method is an improvement over both byte count and the number of repositories, it is in no way a magic wand.

Describe alternatives you've considered

None at the moment.

Additional context

I have gone through this issue mentioned in the FAQ.

P.S.: This is the first time I am trying to contribute to open source. Please point out if I make a mistake.

Feb 18 '22 19:02 kitswas

Hi @kitswas thanks for the proposal, quite interesting.

What do you think of the downsides of this approach? Can you mention any examples? And also how exactly this method is better than just counting bytes? Will be better to have a comparison.

Feb 19 '22 14:02 anuraghazra

What do you think are the downsides of this approach?

The size of a program (sourcecode) to perform the same task varies when written in different languages. This metric does not take it into consideration.
The alternatives (counting bytes and number of repositories with the language) are no better though.

And also how exactly this method is better than just counting bytes?

When a language appears in multiple repositories, it means that the developer has some degree of reliance on it. This is an attempt to factor it into the calculation.

The square root in the new metric is used to tone down the percentage differences between languages.
A general form of the index would be

k = x^p * y^q

where p and q are constants which control the influence of x and y on the metric, respectively.

Will be better to have a comparison.

Here's a theoretical comparison. Comparsion.pdf

This developer has three repositories containing source-code for three websites. He used HTML for all his websites but JavaScript for only one. Yet, due to the massive size of the JS code, his most used language (according to the old metric) is JavaScript. While there's no doubt he has written a lot of code using JS, HTML is what he relied the most upon.

For a real world comparison, I need data.

Feb 19 '22 16:02 kitswas

Here's how it would affect my current card. I admit, I am quite unhappy with Python hogging everything. I cannot use the exclude_repo option as it removes Python entirely.

Note: Since I did not have the number of bytes, I used the percentages for calculation. The result will remain unaffected since we are finding a ratio. Also note that the final percentages are slightly higher than they should be as the orginal percentages for the selected languages add up to 99.36% only.

May 03 '22 01:05 kitswas

Just for reference, in case people want to chip. Part of the discussion relating to this issue can be found in #1732.

May 07 '22 10:05 rickstaa

This was implemented in https://github.com/anuraghazra/github-readme-stats/pull/1732.

Sep 11 '23 20:09 rickstaa