incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

An interesting git import tool by ClickHouse

Open hezyin opened this issue 2 years ago • 9 comments

Alexey from ClickHouse recently showed me a git import tool they developed. It extracts the line-level data from git into a table called line_changes, which can be used to compute interesting metrics like code churn, line age, etc. The tool also runs very fast. We should consider incorporating line-level data in the future as well.

The source code can be found here: https://github.com/ClickHouse/ClickHouse/tree/master/programs/git-import

The questions it can answer from its doc:

Allows to answer questions like:
- list files with maximum number of authors;
- show me the oldest lines of code in the repository;
- show me the files with longest history;
- list favorite files for author;
- list largest files with lowest number of authors;
- at what weekday the code has highest chance to stay in repository;
- the distribution of code age across repository;
- files sorted by average code age;
- quickly show file with blame info (rough);
- commits and lines of code distribution by time; by weekday, by author; for specific subdirectories;
- show history for every subdirectory, file, line of file, the number of changes (lines and commits) across time; how the number of contributors was changed across time;
- list files with most modifications;
- list files that were rewritten most number of time or by most of authors;
- what is percentage of code removal by other authors, across authors;
- the matrix of authors that shows what authors tends to rewrite another authors code;
- what is the worst time to write code in sense that the code has highest chance to be rewritten;
- the average time before code will be rewritten and the median (half-life of code decay);
- comments/code percentage change in time / by author / by location;
- who tend to write more tests / cpp code / comments.

Below are the instructions for how to use the tool:

You can get it like this:

curl https://clickhouse.com/ | sh
- downloads ClickHouse

./clickhouse git-import --help
- will show the documentation and the usage of the tool.

Then the tool can be run directly inside the git repository.
It will collect data like commits, file changes and changes of every
line in every file for further analysis.
It works well even on largest repositories like Linux or Chromium.

Example of a trivial query:

SELECT author AS k, count() AS c FROM line_changes WHERE
file_extension IN ('h', 'cpp') GROUP BY k ORDER BY c DESC LIMIT 20

Example of some non-trivial query - a matrix of authors, how much code
of one author is removed by another:

SELECT k, written_code.c, removed_code.c,
    round(removed_code.c * 100 / written_code.c) AS remove_ratio
FROM (
    SELECT author AS k, count() AS c
    FROM line_changes
    WHERE sign = 1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
    GROUP BY k
) AS written_code
INNER JOIN (
    SELECT prev_author AS k, count() AS c
    FROM line_changes
    WHERE sign = -1 AND file_extension IN ('h', 'cpp')
        AND line_type NOT IN ('Punct', 'Empty')
        AND author != prev_author
    GROUP BY k
) AS removed_code USING (k)
WHERE written_code.c > 1000
ORDER BY c DESC LIMIT 500

hezyin avatar Apr 28 '22 05:04 hezyin

I'll work on it.

xgdyp avatar Jun 17 '22 08:06 xgdyp

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jul 20 '22 00:07 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Jul 27 '22 00:07 github-actions[bot]

@xgdyp Do we have progress regarding this issue?

klesh avatar Jul 27 '22 03:07 klesh

Hi, I'm still working on it, I'll sync my work on this issue to avoid the issue closing. Now, I just finished the process of extracting the git log and currently still not write into lake-repo.

xgdyp avatar Jul 27 '22 06:07 xgdyp

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Aug 27 '22 00:08 github-actions[bot]

Hi, I'm still working on it, I'll sync my work on this issue to avoid the issue closing. Now, I just finished the process of extracting the git log and currently still not write into lake-repo.

Hi @xgdyp, the issue is about to be closed automatically again. How's the progress? :)

yumengwang03 avatar Aug 29 '22 06:08 yumengwang03

Hi @xgdyp, the issue is about to be closed automatically again. How's the progress? :)

hello, I add a code-churn table now, and currently, I'm implementing some metrics.

xgdyp avatar Aug 29 '22 06:08 xgdyp

@xgdyp Nice, would you create a PR for that first? So we could take a peep at what is happening 😃

klesh avatar Aug 29 '22 07:08 klesh

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Sep 30 '22 00:09 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Oct 12 '22 00:10 github-actions[bot]