veniq
veniq copied to clipboard
Scripts for Mining Dataset of RefMiner
- Script where RefMiner 2.0 will be executed for each folder in parallel
- Code Similarity with the tests
Do the similar thing: https://link.springer.com/article/10.1007/s11219-019-09442-9
The similarity between two code smells is based on their text, thanks to this SequenceMatcher, which relies on the Ratcliff and Obershelp’s algorithm, published in 1980, named “gestalt pattern matching.” The main idea of the algorithm is to find the longest contiguous matching subsequence between two compared sequences. We consider two smells as the same if they are from the same smell type (among the 12 studied code smells), and if their similarity degree is greater than 0.7. If one smell of C1 gets a similarity degree greater than 0.7 with two smells of C2, we match it with the one with the highest similarity value.
I just added checking of hamming distance. If Ratcliff and Obershelp’s > 0.7, the check if hamming distance is large than 0.4.
Then then number of matched strings / all strings number
must be > 0.7
- Combining jsons for all repos into 1 dataset and filtration EM methods (with tests)
@lyriccoder please, give PR a proper name and add a description.
@lyriccoder what algorithm you have chosen to measure code similarity?