incubator-devlake
incubator-devlake copied to clipboard
An interesting git import tool by ClickHouse
Alexey from ClickHouse recently showed me a git import tool they developed. It extracts the line-level data from git into a table called line_changes
, which can be used to compute interesting metrics like code churn, line age, etc. The tool also runs very fast. We should consider incorporating line-level data in the future as well.
The source code can be found here: https://github.com/ClickHouse/ClickHouse/tree/master/programs/git-import
The questions it can answer from its doc:
Allows to answer questions like:
- list files with maximum number of authors;
- show me the oldest lines of code in the repository;
- show me the files with longest history;
- list favorite files for author;
- list largest files with lowest number of authors;
- at what weekday the code has highest chance to stay in repository;
- the distribution of code age across repository;
- files sorted by average code age;
- quickly show file with blame info (rough);
- commits and lines of code distribution by time; by weekday, by author; for specific subdirectories;
- show history for every subdirectory, file, line of file, the number of changes (lines and commits) across time; how the number of contributors was changed across time;
- list files with most modifications;
- list files that were rewritten most number of time or by most of authors;
- what is percentage of code removal by other authors, across authors;
- the matrix of authors that shows what authors tends to rewrite another authors code;
- what is the worst time to write code in sense that the code has highest chance to be rewritten;
- the average time before code will be rewritten and the median (half-life of code decay);
- comments/code percentage change in time / by author / by location;
- who tend to write more tests / cpp code / comments.
Below are the instructions for how to use the tool:
You can get it like this:
curl https://clickhouse.com/ | sh
- downloads ClickHouse
./clickhouse git-import --help
- will show the documentation and the usage of the tool.
Then the tool can be run directly inside the git repository.
It will collect data like commits, file changes and changes of every
line in every file for further analysis.
It works well even on largest repositories like Linux or Chromium.
Example of a trivial query:
SELECT author AS k, count() AS c FROM line_changes WHERE
file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20
Example of some non-trivial query - a matrix of authors, how much code
of one author is removed by another:
SELECT k, written_code.c, removed_code.c,
round(removed_code.c * 100 / written_code.c) AS remove_ratio
FROM (
SELECT author AS k, count() AS c
FROM line_changes
WHERE sign = 1 AND file_extension IN ('h', 'cpp')
AND line_type NOT IN ('Punct', 'Empty')
GROUP BY k
) AS written_code
INNER JOIN (
SELECT prev_author AS k, count() AS c
FROM line_changes
WHERE sign = -1 AND file_extension IN ('h', 'cpp')
AND line_type NOT IN ('Punct', 'Empty')
AND author != prev_author
GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY c DESC LIMIT 500
I'll work on it.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
@xgdyp Do we have progress regarding this issue?
Hi, I'm still working on it, I'll sync my work on this issue to avoid the issue closing. Now, I just finished the process of extracting the git log and currently still not write into lake-repo.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
Hi, I'm still working on it, I'll sync my work on this issue to avoid the issue closing. Now, I just finished the process of extracting the git log and currently still not write into lake-repo.
Hi @xgdyp, the issue is about to be closed automatically again. How's the progress? :)
Hi @xgdyp, the issue is about to be closed automatically again. How's the progress? :)
hello, I add a code-churn table now, and currently, I'm implementing some metrics.
@xgdyp Nice, would you create a PR for that first? So we could take a peep at what is happening 😃
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.