bugbug
bugbug copied to clipboard
[model:component] Add sampling techniques to address imbalanced dataset
Resolves #4281.
Investigating and adding sampling techniques (i.e. SMOTE, SMOTEEN, RandomUndersampling) to address the imbalanced dataset of bugs.
Still a WIP. Metrics collected from SMOTE can be found here: metrics.log
Takeaways:
- SMOTE increases the training time to around 2 hours (as opposed to the current 30-40 minute training time)
- The precision and accuracy are extremely low
This is most likely due to the huge differences in the number of bugs in different products and components (1000+ vs 20), and SMOTE matches the number of bugs in each minority class to the majority class, making the ratio of synthetic data to real data very large.
@suhaibmujahid Since this PR is mainly experimental, and the results were proven to be much worse, I think we can close this. WDYT?