hercules icon indicating copy to clipboard operation
hercules copied to clipboard

Bugfix pattern mining

Open vmarkovtsev opened this issue 7 years ago • 1 comments

  • [ ] Add Hercules analysis to classify commits as bugfix and other appropriate classes - https://github.com/src-d/hercules/issues/188
  • [ ] Take the UAST difference algorithm implementation from https://github.com/quinor/sdk/tree/master/uast/diff
  • [ ] Study how it performs on the real files
  • [ ] Add Hercules analysis to extract the structured diffs
  • [ ] Add Hercules analysis to mine bugfix patterns

TODO for @Jan21: add here the list of relevant papers

vmarkovtsev avatar Feb 06 '19 13:02 vmarkovtsev

Copy-paste of https://github.com/src-d/hercules/issues/188#issuecomment-461127952

About classification - it's possible to collect some labels from PRs in huge repositories like

  • https://github.com/tensorflow/tensorflow/labels and bug related
  • https://github.com/pytorch/pytorch/labels and bug related
  • and so on

Pattern mining

I like an idea of pattern mining. I would suggest making commit deduplication + community detection / (or clustering / topic modeling) instead of classification. Why:

  • it's not clear how many classes we have (50+ labels in each repository that I mentioned above).
  • it could be quite a big variability even inside one class.
  • incredible number of commits

deduplication + community detection

The way how to do it the same as in apollo:

  • extract features: textual, structural, etc -> bag-of-something.
  • fuzzy deduplication + hyperopt (it may require manual labeling or automatic calculating similarity score and selecting threshold).
  • connected components + community detection.
  • descriptive statistics of each community / labeling.

So after it will be possible to query and receive communities and their descriptions / labels.

topic modeling

How to - standart topic modeling pipeline:

  • extract features: textual, structural, etc -> bag-of-something.
  • hierarchical or simple topic modeling using bigartm and vis

So after it will be possible to query and receive topics for commits.

One more point - that the first step (feature extraction) could be reused for several different approaches.

EgorBu avatar Feb 06 '19 19:02 EgorBu