webcompat.com
webcompat.com copied to clipboard
Research bugbug label prediction feasibility
Filing this issue to track findings on experimenting with models for label prediction with bugbug.
I’ve collected additional data for this experiment (used issues timeline
instead of events
) as timeline
contains information about issue duplicates via cross-referenced events. For example, this issue has 2 duplicates and a type-override
label. These duplicates will contribute to type-override
=1 class, increasing the number of issues in this class.
I have built two models for each label, with and without duplicates and compared the metrics. For reference, this is a condition I’m using to determine duplicates:
if (
event["event"] == "cross-referenced" and
event["source"]["type"] == "issue" and
event["source"]["issue"]["repository"]["full_name"] == "webcompat/web-bugs" and
event["source"]["issue"]["milestone"]["title"] == "duplicate"
):
The list of possible labels to predict: https://github.com/webcompat/web-bugs/labels?page=1&sort=count-desc I’ve experimented with labels that are not added by webcompat-bot and have the most issues.
nsfw:
The goal of this model is to label incoming nsfw issues and based on that we can delete screenshots.
For this model, I’ve used only titles of issues in the form of:
www.xxx.com - desktop site instead of mobile site
All issues that were ever moved toneedsdiagnosis
and don’t have nsfw
label are assigned class 0 (not nsfw) and all issues regardless of milestone with label nsfw
assigned class 1 (nsfw)
I didn’t use issues body as the model that includes it (i.e. Browser, Description, Steps to reproduce, etc.) had worse performance. Likely because a lot of nsfw issues are related to problems with audio/video (but these are not exclusive to nsfw), so the model was predicting that certain video-related issues are nsfw, while in fact they weren’t (had a higher ratio of false positives).
Without duplicates: Confidence threshold > 0.7 - 813 classified
pre rec spe f1 geo iba sup
1 0.99 0.48 1.00 0.65 0.69 0.46 147
0 0.94 0.96 0.71 0.95 0.83 0.70 730
Confidence threshold > 0.8 - 717 classified
pre rec spe f1 geo iba sup
1 0.99 0.46 1.00 0.62 0.67 0.43 147
0 0.96 0.85 0.82 0.90 0.83 0.70 730
Confidence threshold > 0.9 - 556 classified
pre rec spe f1 geo iba sup
1 0.98 0.39 1.00 0.56 0.62 0.36 147
0 0.97 0.66 0.88 0.78 0.76 0.57 730
Confidence threshold > 0.95 - 363 classified
pre rec spe f1 geo iba sup
1 0.98 0.36 1.00 0.53 0.60 0.34 147
0 0.98 0.42 0.96 0.58 0.63 0.38 730
Confidence threshold > 0.97 - 266 classified
pre rec spe f1 geo iba sup
1 0.98 0.33 1.00 0.50 0.58 0.31 147
0 0.98 0.29 0.97 0.45 0.53 0.26 730
With duplicates: This is the same model, but with duplicates contributing to class 1 (nsfw). It has slightly worse performance for confidence thresholds less than 90%, however improves as the threshold increases.
Confidence threshold > 0.7 - 812 classified
pre rec spe f1 geo iba sup
1 0.94 0.46 0.99 0.61 0.67 0.43 182
0 0.91 0.94 0.65 0.93 0.78 0.63 703
Confidence threshold > 0.8 - 703 classified
pre rec spe f1 geo iba sup
1 0.96 0.44 1.00 0.60 0.66 0.41 182
0 0.94 0.83 0.80 0.88 0.82 0.67 703
Confidence threshold > 0.9 - 531 classified
pre rec spe f1 geo iba sup
1 1.00 0.36 1.00 0.53 0.60 0.33 182
0 0.96 0.64 0.91 0.77 0.76 0.56 703
Confidence threshold > 0.95 - 332 classified
pre rec spe f1 geo iba sup
1 1.00 0.29 1.00 0.45 0.54 0.27 182
0 0.99 0.39 0.98 0.56 0.62 0.36 703
Confidence threshold > 0.97 - 218 classified
pre rec spe f1 geo iba sup
1 1.00 0.25 1.00 0.40 0.50 0.23 182
0 0.98 0.24 0.98 0.38 0.48 0.22 703
Summary: This model with a confidence threshold of 80% is able to find 46% of nsfw issues and is wrong 1% of the time. This result would work for us, however, without an issue body, it’s only the domain name that matters for predicting whether the issue is nsfw. Thinking, perhaps, having a block list of nsfw domains instead will produce better results as opposed to predicting it with ML. Also, maybe it's possible to improve the model to increase the amount of found nsfw issues.
unsupported / ua-override
The goal of this model is to label issues that are related to Firefox being unsupported on certain sites due to UA detection. It would be beneficial to label such issues and move them to needsdiagnosis after moderation bypassing triage.
Without duplicates
I’ve experimented with both labels and the most successful was just using type-unsupported
label alone.
Confidence threshold > 0.7 - 705 classified
pre rec spe f1 geo iba sup
1 0.78 0.45 0.99 0.57 0.67 0.42 47
0 0.97 0.99 0.55 0.98 0.74 0.57 667
Confidence threshold > 0.8 - 698 classified
pre rec spe f1 geo iba sup
1 0.77 0.36 0.99 0.49 0.60 0.34 47
0 0.97 0.98 0.57 0.98 0.75 0.59 667
Confidence threshold > 0.9 - 680 classified
pre rec spe f1 geo iba sup
1 0.80 0.26 1.00 0.39 0.50 0.24 47
0 0.97 0.97 0.64 0.97 0.79 0.64 667
Confidence threshold > 0.95 - 669 classified
pre rec spe f1 geo iba sup
1 0.91 0.21 1.00 0.34 0.46 0.20 47
0 0.98 0.96 0.66 0.97 0.80 0.65 667
Confidence threshold > 0.97 - 659 classified
pre rec spe f1 geo iba sup
1 0.89 0.17 1.00 0.29 0.41 0.16 47
0 0.98 0.96 0.72 0.97 0.83 0.71 667
With duplicates: In the case of this label, the model performed slightly better with duplicates, with a confidence threshold of 90% finding 35% of the issues and wrong 4% of the time.
Confidence threshold > 0.7 - 717 classified
pre rec spe f1 geo iba sup
1 0.91 0.46 1.00 0.61 0.68 0.43 63
0 0.96 0.99 0.54 0.97 0.73 0.56 663
Confidence threshold > 0.8 - 706 classified
pre rec spe f1 geo iba sup
1 0.93 0.44 1.00 0.60 0.67 0.42 63
0 0.96 0.98 0.59 0.97 0.76 0.60 663
Confidence threshold > 0.9 - 686 classified
pre rec spe f1 geo iba sup
1 0.96 0.35 1.00 0.51 0.59 0.33 63
0 0.97 0.97 0.63 0.97 0.78 0.63 663
Confidence threshold > 0.95 - 668 classified
pre rec spe f1 geo iba sup
1 0.95 0.32 1.00 0.48 0.56 0.30 63
0 0.97 0.95 0.73 0.96 0.83 0.71 663
Confidence threshold > 0.97 - 660 classified
pre rec spe f1 geo iba sup
1 0.95 0.30 1.00 0.46 0.55 0.28 63
0 0.97 0.94 0.75 0.96 0.84 0.72 663
Summary: The best result is for a model with a confidence threshold of 90% finding 35% of the issues and wrong 4% of the time. This model potentially could be improved and used, however, I realized that this labelling might not be very efficient in our process. Right now, if an issue has already been reported, it’s being closed and marked as a duplicate on the triage step. However, since we don’t have an ML mechanism to find duplicates, this labelling might conflict with the manual process. For example, the model finds an issue that is likely to receive type-unsupported
label and moves it to needsdiagnosis, but there might already be a duplicate for it and at that point, this could potentially create more work, as opposed to the manual process.
There is an option to not move it to needsdiagnosis milestone and only add a label, however, this label without further action might not be worth it, as it’s not that insightful :) It will be interesting to investigate a model for finding duplicates as it could be more useful in conjunction with this or any other labels, if we’re able to build a model with reasonable metrics.
Tracking protection labels
(type-tracking-protection-basic
, type-tracking-protection-strict
, type-trackingprotection
)
A goal of this model is to add a type-trackingprotection
to potentially detect issues caused by basic/strict tracking protection. As we have a lot of issues with these labels, it made sense to try and build a model for prediction. As I later discovered, only issues with labels added by a human should count towards type-trackingprotection
=1 as often this label comes as an extra-label from the reporter (it gets added by webcompat-bot
) and despite having the tracking protection extra-label, those issues might be of a different nature.
Confidence threshold > 0.8 - 724 classified
pre rec spe f1 geo iba sup
1 0.77 0.15 1.00 0.25 0.38 0.13 68
0 0.94 0.97 0.40 0.95 0.62 0.41 693
Confidence threshold > 0.9 - 672 classified
pre rec spe f1 geo iba sup
1 0.89 0.12 1.00 0.21 0.34 0.11 68
0 0.95 0.91 0.53 0.93 0.69 0.50 693
Confidence threshold > 0.95 - 604 classified
pre rec spe f1 geo iba sup
1 1.00 0.06 1.00 0.11 0.24 0.05 68
0 0.96 0.83 0.65 0.89 0.73 0.55 693
Confidence threshold > 0.97 - 531 classified
pre rec spe f1 geo iba sup
1 1.00 0.04 1.00 0.08 0.21 0.04 68
0 0.98 0.74 0.82 0.85 0.78 0.61 693
Summary: Unfortunately, the model doesn’t show results that would work for us as the number of issues it able to find with an acceptable precision is very low
Other labels
I’ve tried building a few other models and didn’t get acceptable results.
### type-media
One interesting observation is despite the fact that type-media
label has quite a lot of issues labelled with it (almost 800) and the model metrics were very good, it didn’t do a good job of accurately predicting media issues. Turns out there was an inflation in metrics due to a lot of such issues (636):
https://github.com/webcompat/web-bugs/issues/7925 https://github.com/webcompat/web-bugs/issues/7974
The issues contained technical information with Error codes and label type-media
was added by the bot. We don’t receive issues with this content anymore, however, the model was biased towards them. After removing most of these issues, the model, unfortunately, didn’t show good results.