veniq icon indicating copy to clipboard operation
veniq copied to clipboard

Scripts for Mining Dataset of RefMiner

Open lyriccoder opened this issue 4 years ago • 2 comments

  1. Script where RefMiner 2.0 will be executed for each folder in parallel
  2. Code Similarity with the tests

Do the similar thing: https://link.springer.com/article/10.1007/s11219-019-09442-9

The similarity between two code smells is based on their text, thanks to this SequenceMatcher, which relies on the Ratcliff and Obershelp’s algorithm, published in 1980, named “gestalt pattern matching.” The main idea of the algorithm is to find the longest contiguous matching subsequence between two compared sequences. We consider two smells as the same if they are from the same smell type (among the 12 studied code smells), and if their similarity degree is greater than 0.7. If one smell of C1 gets a similarity degree greater than 0.7 with two smells of C2, we match it with the one with the highest similarity value.

I just added checking of hamming distance. If Ratcliff and Obershelp’s > 0.7, the check if hamming distance is large than 0.4. Then then number of matched strings / all strings number must be > 0.7

  1. Combining jsons for all repos into 1 dataset and filtration EM methods (with tests)

lyriccoder avatar Nov 23 '20 15:11 lyriccoder

@lyriccoder please, give PR a proper name and add a description.

acheshkov avatar Nov 26 '20 06:11 acheshkov

@lyriccoder what algorithm you have chosen to measure code similarity?

acheshkov avatar Nov 26 '20 06:11 acheshkov