Bugfix pattern mining
- [ ] Add Hercules analysis to classify commits as bugfix and other appropriate classes - https://github.com/src-d/hercules/issues/188
- [ ] Take the UAST difference algorithm implementation from https://github.com/quinor/sdk/tree/master/uast/diff
- [ ] Study how it performs on the real files
- [ ] Add Hercules analysis to extract the structured diffs
- [ ] Add Hercules analysis to mine bugfix patterns
TODO for @Jan21: add here the list of relevant papers
Copy-paste of https://github.com/src-d/hercules/issues/188#issuecomment-461127952
About classification - it's possible to collect some labels from PRs in huge repositories like
- https://github.com/tensorflow/tensorflow/labels and bug related
- https://github.com/pytorch/pytorch/labels and bug related
- and so on
Pattern mining
I like an idea of pattern mining.
I would suggest making commit deduplication + community detection / (or clustering / topic modeling) instead of classification.
Why:
- it's not clear how many classes we have (50+ labels in each repository that I mentioned above).
- it could be quite a big variability even inside one class.
- incredible number of commits
deduplication + community detection
The way how to do it the same as in apollo:
- extract features: textual, structural, etc -> bag-of-something.
- fuzzy deduplication + hyperopt (it may require manual labeling or automatic calculating similarity score and selecting threshold).
- connected components + community detection.
- descriptive statistics of each community / labeling.
So after it will be possible to query and receive communities and their descriptions / labels.
topic modeling
How to - standart topic modeling pipeline:
- extract features: textual, structural, etc -> bag-of-something.
- hierarchical or simple topic modeling using bigartm and vis
So after it will be possible to query and receive topics for commits.
One more point - that the first step (feature extraction) could be reused for several different approaches.