Add features based on file paths in the title and description
Resolves #4269.
Introduces new feature that uses file paths mentioned in the title and description of a bug and splits it into sub-paths and individual directories/files.
Metrics of the newly trained model: metrics.log
Do you see significant improvement when adding this feature?
I've previously attached the metrics of the model here:
Metrics of the newly trained model: metrics.log
Here are the metrics of the original/current model: metrics_original.log
There is a slight improvement (~ +1%) in each of the metrics.
I've converted this PR to a draft, as I realized there still needs some polishing to do with the extraction of file paths. For example, there are cases where it may mistake a URL or a step (i.e. 1.Step 1, 2.Step2) as a file path. Once done, I'll be sure to add a few tests for this feature!
Current metrics: metrics.log
Seems to perform slightly worse than the current model and python3 -m scripts.bug_classifier component --bug-id 1902245 classifies this bug as Core::Widget: Gtk (which is incorrect).
It is worth noting that the first instance of the file path feature model correctly classified the above bug as Core::Networking, despite it not 100% correctly retrieving the relevant file paths from the bug summary and description. Will continue to look into this.
The current model now classifies python3 -m scripts.bug_classifier component --bug-id 1902245 correctly as Core::Networking. The metrics can be found here: metrics.log.
Looks good in general, but could you add a few tests for the new class?
Added two tests here: a5a9c0f10abf960f37026d3df8214c2fa155ef24
Seems like the tests failed, I'll do some revisions for these ASAP.
What is the difference in average precision / recall? Is there any component which gets much better or much worse?
What is the difference in average precision / recall? Is there any component which gets much better or much worse?
Here are the metrics from the model with the FilePaths feature included: new_model.log
Here are the metrics from the currently deployed model (which does not include the FilePaths feature): old_model.log
For the 0.9 CF, the precision increased by 0.02 and recall increased by 0.01.
Overall, there seems to be an increase in most metrics for specific product-component pairs, however feel free to consult the detailed metrics for the few cases where either the precision or recall dropped with the new model.
Given your latest changes, was there any effect on the metrics?
Training the model with the file path feature included and excluded, I got the following results:
| ct | Feature Inclusion | pre | rec | spe | f1 | geo | iba | sup |
|---|---|---|---|---|---|---|---|---|
| Training Set | With File Path | 0.95 | 0.95 | 1.00 | 0.95 | 0.97 | 0.95 | 73665 |
| Without File Path | 0.95 | 0.95 | 1.00 | 0.95 | 0.98 | 0.95 | 73656 | |
| No CT | With File Path | 0.64 | 0.63 | 0.99 | 0.62 | 0.78 | 0.60 | 8185 |
| Without File Path | 0.63 | 0.62 | 0.99 | 0.61 | 0.77 | 0.59 | 8184 | |
| 60% CT | With File Path | 0.46 | 0.33 | 1.00 | 0.38 | 0.44 | 0.32 | 8185 |
| Without File Path | 0.44 | 0.32 | 1.00 | 0.36 | 0.42 | 0.31 | 8184 | |
| 70% CT | With File Path | 0.47 | 0.32 | 1.00 | 0.37 | 0.42 | 0.30 | 8185 |
| Without File Path | 0.45 | 0.30 | 1.00 | 0.36 | 0.41 | 0.29 | 8184 | |
| 80% CT | With File Path | 0.49 | 0.29 | 1.00 | 0.36 | 0.41 | 0.28 | 8185 |
| Without File Path | 0.47 | 0.28 | 1.00 | 0.34 | 0.39 | 0.27 | 8184 | |
| 90% CT | With File Path | 0.50 | 0.26 | 1.00 | 0.33 | 0.38 | 0.25 | 8185 |
| Without File Path | 0.48 | 0.25 | 1.00 | 0.32 | 0.36 | 0.24 | 8184 |
Overall, there seems to be a marginal increase in precision and recall when the file path feature is included.