bugbug Add features based on file paths in the title and description

Resolves #4269.

Introduces new feature that uses file paths mentioned in the title and description of a bug and splits it into sub-paths and individual directories/files.

Jun 20 '24 19:06 benjaminmah

Metrics of the newly trained model: metrics.log

Jun 20 '24 19:06 benjaminmah

Do you see significant improvement when adding this feature?

I've previously attached the metrics of the model here:

Metrics of the newly trained model: metrics.log

Here are the metrics of the original/current model: metrics_original.log

There is a slight improvement (~ +1%) in each of the metrics.

Jul 02 '24 14:07 benjaminmah

I've converted this PR to a draft, as I realized there still needs some polishing to do with the extraction of file paths. For example, there are cases where it may mistake a URL or a step (i.e. 1.Step 1, 2.Step2) as a file path. Once done, I'll be sure to add a few tests for this feature!

Jul 18 '24 20:07 benjaminmah

Current metrics: metrics.log

Seems to perform slightly worse than the current model and python3 -m scripts.bug_classifier component --bug-id 1902245 classifies this bug as Core::Widget: Gtk (which is incorrect).

It is worth noting that the first instance of the file path feature model correctly classified the above bug as Core::Networking, despite it not 100% correctly retrieving the relevant file paths from the bug summary and description. Will continue to look into this.

Jul 19 '24 20:07 benjaminmah

The current model now classifies python3 -m scripts.bug_classifier component --bug-id 1902245 correctly as Core::Networking. The metrics can be found here: metrics.log.

Jul 22 '24 17:07 benjaminmah

Looks good in general, but could you add a few tests for the new class?

Added two tests here: a5a9c0f10abf960f37026d3df8214c2fa155ef24

Jul 24 '24 15:07 benjaminmah

Seems like the tests failed, I'll do some revisions for these ASAP.

Jul 29 '24 13:07 benjaminmah

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

Aug 01 '24 14:08 marco-c

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

Here are the metrics from the model with the FilePaths feature included: new_model.log

Here are the metrics from the currently deployed model (which does not include the FilePaths feature): old_model.log

For the 0.9 CF, the precision increased by 0.02 and recall increased by 0.01.

Overall, there seems to be an increase in most metrics for specific product-component pairs, however feel free to consult the detailed metrics for the few cases where either the precision or recall dropped with the new model.

Aug 02 '24 17:08 benjaminmah

Given your latest changes, was there any effect on the metrics?

Training the model with the file path feature included and excluded, I got the following results:

ct	Feature Inclusion	pre	rec	spe	f1	geo	iba	sup
Training Set	With File Path	0.95	0.95	1.00	0.95	0.97	0.95	73665
	Without File Path	0.95	0.95	1.00	0.95	0.98	0.95	73656
No CT	With File Path	0.64	0.63	0.99	0.62	0.78	0.60	8185
	Without File Path	0.63	0.62	0.99	0.61	0.77	0.59	8184
60% CT	With File Path	0.46	0.33	1.00	0.38	0.44	0.32	8185
	Without File Path	0.44	0.32	1.00	0.36	0.42	0.31	8184
70% CT	With File Path	0.47	0.32	1.00	0.37	0.42	0.30	8185
	Without File Path	0.45	0.30	1.00	0.36	0.41	0.29	8184
80% CT	With File Path	0.49	0.29	1.00	0.36	0.41	0.28	8185
	Without File Path	0.47	0.28	1.00	0.34	0.39	0.27	8184
90% CT	With File Path	0.50	0.26	1.00	0.33	0.38	0.25	8185
	Without File Path	0.48	0.25	1.00	0.32	0.36	0.24	8184

Overall, there seems to be a marginal increase in precision and recall when the file path feature is included.

Oct 18 '24 18:10 benjaminmah