RepoSense Enable more rigorous authorship detection by default

Current: last person modified is identified as the author

Suggested: Use git features to ignore whitespace changes, code movement etc. by default but give a way to switch to the faster (less accurate) method

Jan 07 '19 12:01 damithc

@yamidark what do you think? I remember I asked you to switch off this feature to make the analysis faster. Can we bring it back?

Jan 07 '19 12:01 damithc

Can we bring it back?

Yes, we can bring this back if we want. Currently, the 'git blame' command we use only ignores whitespace changes, we can track code movement/copied using the -M and -C option in git blame.

This could make the analysis more 'accurate', but would take a (much) larger amount of time depending on the 'level' of code movement we want to track (currently there are 3 levels for the -C option), so we need to decide on this as well (iirc, previously it was at 2).

Jan 07 '19 12:01 yamidark

This could make the analysis more 'accurate', but would take a (much) larger amount of time depending on the 'level' of code movement we want to track (currently there are 3 levels for the -C option), so we need to decide on this as well (iirc, previously it was at 2).

Perhaps we can try different settings on cs2103 dataset to see how much of a time difference it makes?

Jan 07 '19 13:01 damithc

Perhaps we can try different settings on cs2103 dataset to see how much of a time difference it makes?

We can use a smaller set (e.g., 10 teams) and extrapolate.

Jan 07 '19 13:01 damithc

Tested using these config files(config.zip)

No move/copy detection: ~2min30s -M (detect within the file): ~3min30s -M -C (detect between other files within same commit): ~5min -M -C -C (detect between other files in the commit which created the file): ~21min -M -C -C -C (detect by searching all commits): ~52min

Seems setting at least -C -C will take up significantly more time, as it has to search other commits if the line was moved/copied, while up to -C is still relatively fast enough (approximately twice the time) since it only searches within a single commit.

Will also try testing the performance on a much larger repo (Teammates)

Jan 10 '19 17:01 yamidark

Thanks for the update @yamidark Can you roughly quantify the difference in output? i.e., how much of the authorship changes for each setting?

Jan 11 '19 03:01 damithc

Can you roughly quantify the difference in output? i.e., how much of the authorship changes for each setting?

Tested with the same config file given: From no settings to -C -M per level, there was a small~moderate amount of author line contribution changes (about 100++ for 1 author), generally all decreased in this case, since most of the code that are moved/copied were from other contributors of AB4.

The largest amount of authorship line changes occurs from -C -M to -C -C -M(can go up to 1000++ for 1 author), while from -C -C -M to -C -C -C -M it has the least amount of authorship line changes (<100 for 1 author).

Also tested on TEAMMATES repo with -C -C -M, and the analysis took about 5hrs :scream: .

Jan 14 '19 15:01 yamidark

I see.

Which one is taking more time? copy detection or move detection?
Can there be multiple levels of move detection?

Jan 15 '19 00:01 damithc

Which one is taking more time? copy detection or move detection?

Can there be multiple levels of move detection?

Copy and move detection is done together (-M already does both move and copy detection, just only within the file with line changes), and I don't think we can configure it to do either 1 separately. But my guess would be that copy detection is longer, since it would have to check all lines (including untouched ones) in the files, rather than just the removed lines for move detection.

Summary on what each 'level' of copy and move detection can be found in my comment above, more details can be found here on the git blame docs.

Jan 15 '19 03:01 yamidark

For our use case, it is reasonable to not credit the author for moving code but it is also reasonable to credit the author for copied code as the copy could be used for a different purpose. If we cannot do move detection without copy detection, perhaps we should not enable either by default, but allow users to enable it using a flag with the caveat of slow performance. What do you guys think?

Jan 15 '19 04:01 damithc

For our use case, it is reasonable to not credit the author for moving code but it is also reasonable to credit the author for copied code as the copy could be used for a different purpose.

Yes, I agree we should credit the original author for code that is moved, but copied code could be credited to the person who copies.

If we cannot do move detection without copy detection, perhaps we should not enable either by default, but allow users to enable it using a flag with the caveat of slow performance.

Yes, that sounds good to me.

Jan 17 '19 08:01 yamidark