Discrepancies between Github's `languages.csv` versus the githut data source
Context
https://danielzayas.github.io/language_trends is a small page powered by the public https://github.com/danielzayas/language_trends repo. The languages.csv data through 2024 Q3 was sourced from https://github.com/github/innovationgraph/blob/main/data/languages.csv on 2024-03-23 around 11pm PT.
Data
languages.csv data through 2024 Q3 was sourced from https://github.com/github/innovationgraph/blob/main/data/languages.csv on 2024-03-23 around 11pm PT.
I also create a new Issue https://github.com/github/innovationgraph/issues/47 to as for more recent data (2024 Q4, 2025 Q1, etc.).
Acknowledgements
- Github for publishing the CSV. Y'all should really improve the data visualization on your https://innovationgraph.github.com/global-metrics/programming-languages page though.
- @madnight for creating a beautiful UI under the AGPL 3.0 license at https://github.com/madnight/githut, but sadly the last quarter in the data source is 2024 Q1.
Question
Filterting https://madnight.github.io/githut/#/pushes/2024/1, which is powered by https://github.com/madnight/githut, for "PUSHES" through 2024 Q1 tells a very different story about language trends. For example, consider 2024 Q1. Github's languages.csv has JavaScript at 18% of pushes versus @madnight's data source has JavaScript at 11% of pushes. Why the large discrepancy @madnight ?:
Why the large discrepancy @madnight ?:
@danielzayas my dataset is not based on the CSVs you have linked, but from Goolge BigQuery (public github dataset). In addition to that I use a BOT filter https://github.com/madnight/githut/blob/master/scripts/query.js#L62
FROM ${tables} WHERE NOT LOWER(actor.login) LIKE "%bot%") a
Bots like dependabot, which generate a large number of pushes, are not included in my graphs.