webcompat.com icon indicating copy to clipboard operation
webcompat.com copied to clipboard

Research bugbug label prediction feasibility

Open ksy36 opened this issue 3 years ago • 4 comments

Filing this issue to track findings on experimenting with models for label prediction with bugbug.

I’ve collected additional data for this experiment (used issues timeline instead of events) as timeline contains information about issue duplicates via cross-referenced events. For example, this issue has 2 duplicates and a type-override label. These duplicates will contribute to type-override=1 class, increasing the number of issues in this class.

I have built two models for each label, with and without duplicates and compared the metrics. For reference, this is a condition I’m using to determine duplicates:

if (
   event["event"] == "cross-referenced" and
   event["source"]["type"] == "issue" and
   event["source"]["issue"]["repository"]["full_name"] == "webcompat/web-bugs" and
   event["source"]["issue"]["milestone"]["title"] == "duplicate"
):

The list of possible labels to predict: https://github.com/webcompat/web-bugs/labels?page=1&sort=count-desc I’ve experimented with labels that are not added by webcompat-bot and have the most issues.

ksy36 avatar Jan 07 '22 17:01 ksy36

nsfw:

The goal of this model is to label incoming nsfw issues and based on that we can delete screenshots.

For this model, I’ve used only titles of issues in the form of: www.xxx.com - desktop site instead of mobile site

All issues that were ever moved toneedsdiagnosis and don’t have nsfw label are assigned class 0 (not nsfw) and all issues regardless of milestone with label nsfwassigned class 1 (nsfw)

I didn’t use issues body as the model that includes it (i.e. Browser, Description, Steps to reproduce, etc.) had worse performance. Likely because a lot of nsfw issues are related to problems with audio/video (but these are not exclusive to nsfw), so the model was predicting that certain video-related issues are nsfw, while in fact they weren’t (had a higher ratio of false positives).

Without duplicates: Confidence threshold > 0.7 - 813 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.99      0.48      1.00      0.65      0.69      0.46       147
             0       0.94      0.96      0.71      0.95      0.83      0.70       730

Confidence threshold > 0.8 - 717 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.99      0.46      1.00      0.62      0.67      0.43       147
             0       0.96      0.85      0.82      0.90      0.83      0.70       730

Confidence threshold > 0.9 - 556 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.98      0.39      1.00      0.56      0.62      0.36       147
             0       0.97      0.66      0.88      0.78      0.76      0.57       730

Confidence threshold > 0.95 - 363 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.98      0.36      1.00      0.53      0.60      0.34       147
             0       0.98      0.42      0.96      0.58      0.63      0.38       730

Confidence threshold > 0.97 - 266 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.98      0.33      1.00      0.50      0.58      0.31       147
             0       0.98      0.29      0.97      0.45      0.53      0.26       730

With duplicates: This is the same model, but with duplicates contributing to class 1 (nsfw). It has slightly worse performance for confidence thresholds less than 90%, however improves as the threshold increases.

Confidence threshold > 0.7 - 812 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.94      0.46      0.99      0.61      0.67      0.43       182
             0       0.91      0.94      0.65      0.93      0.78      0.63       703

Confidence threshold > 0.8 - 703 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.96      0.44      1.00      0.60      0.66      0.41       182
             0       0.94      0.83      0.80      0.88      0.82      0.67       703

Confidence threshold > 0.9 - 531 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       1.00      0.36      1.00      0.53      0.60      0.33       182
             0       0.96      0.64      0.91      0.77      0.76      0.56       703

Confidence threshold > 0.95 - 332 classified

                        pre       rec       spe        f1       geo       iba       sup

             1       1.00      0.29      1.00      0.45      0.54      0.27       182
             0       0.99      0.39      0.98      0.56      0.62      0.36       703

Confidence threshold > 0.97 - 218 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       1.00      0.25      1.00      0.40      0.50      0.23       182
             0       0.98      0.24      0.98      0.38      0.48      0.22       703

Summary: This model with a confidence threshold of 80% is able to find 46% of nsfw issues and is wrong 1% of the time. This result would work for us, however, without an issue body, it’s only the domain name that matters for predicting whether the issue is nsfw. Thinking, perhaps, having a block list of nsfw domains instead will produce better results as opposed to predicting it with ML. Also, maybe it's possible to improve the model to increase the amount of found nsfw issues.

ksy36 avatar Jan 07 '22 18:01 ksy36

unsupported / ua-override

The goal of this model is to label issues that are related to Firefox being unsupported on certain sites due to UA detection. It would be beneficial to label such issues and move them to needsdiagnosis after moderation bypassing triage.

Without duplicates I’ve experimented with both labels and the most successful was just using type-unsupported label alone.

Confidence threshold > 0.7 - 705 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.78      0.45      0.99      0.57      0.67      0.42        47
             0       0.97      0.99      0.55      0.98      0.74      0.57       667

Confidence threshold > 0.8 - 698 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.77      0.36      0.99      0.49      0.60      0.34        47
             0       0.97      0.98      0.57      0.98      0.75      0.59       667

Confidence threshold > 0.9 - 680 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.80      0.26      1.00      0.39      0.50      0.24        47
             0       0.97      0.97      0.64      0.97      0.79      0.64       667

Confidence threshold > 0.95 - 669 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.91      0.21      1.00      0.34      0.46      0.20        47
             0       0.98      0.96      0.66      0.97      0.80      0.65       667

Confidence threshold > 0.97 - 659 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.89      0.17      1.00      0.29      0.41      0.16        47
             0       0.98      0.96      0.72      0.97      0.83      0.71       667

With duplicates: In the case of this label, the model performed slightly better with duplicates, with a confidence threshold of 90% finding 35% of the issues and wrong 4% of the time.

Confidence threshold > 0.7 - 717 classified

                     pre       rec       spe        f1       geo       iba       sup

             1       0.91      0.46      1.00      0.61      0.68      0.43        63
             0       0.96      0.99      0.54      0.97      0.73      0.56       663

Confidence threshold > 0.8 - 706 classified

                         pre       rec       spe        f1       geo       iba       sup

             1       0.93      0.44      1.00      0.60      0.67      0.42        63
             0       0.96      0.98      0.59      0.97      0.76      0.60       663

Confidence threshold > 0.9 - 686 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.96      0.35      1.00      0.51      0.59      0.33        63
             0       0.97      0.97      0.63      0.97      0.78      0.63       663

Confidence threshold > 0.95 - 668 classified

                      pre       rec       spe        f1       geo       iba       sup

             1       0.95      0.32      1.00      0.48      0.56      0.30        63
             0       0.97      0.95      0.73      0.96      0.83      0.71       663

Confidence threshold > 0.97 - 660 classified

                     pre       rec       spe        f1       geo       iba       sup

             1       0.95      0.30      1.00      0.46      0.55      0.28        63
             0       0.97      0.94      0.75      0.96      0.84      0.72       663

Summary: The best result is for a model with a confidence threshold of 90% finding 35% of the issues and wrong 4% of the time. This model potentially could be improved and used, however, I realized that this labelling might not be very efficient in our process. Right now, if an issue has already been reported, it’s being closed and marked as a duplicate on the triage step. However, since we don’t have an ML mechanism to find duplicates, this labelling might conflict with the manual process. For example, the model finds an issue that is likely to receive type-unsupported label and moves it to needsdiagnosis, but there might already be a duplicate for it and at that point, this could potentially create more work, as opposed to the manual process. There is an option to not move it to needsdiagnosis milestone and only add a label, however, this label without further action might not be worth it, as it’s not that insightful :) It will be interesting to investigate a model for finding duplicates as it could be more useful in conjunction with this or any other labels, if we’re able to build a model with reasonable metrics.

ksy36 avatar Jan 07 '22 18:01 ksy36

Tracking protection labels

(type-tracking-protection-basic, type-tracking-protection-strict, type-trackingprotection)

A goal of this model is to add a type-trackingprotection to potentially detect issues caused by basic/strict tracking protection. As we have a lot of issues with these labels, it made sense to try and build a model for prediction. As I later discovered, only issues with labels added by a human should count towards type-trackingprotection=1 as often this label comes as an extra-label from the reporter (it gets added by webcompat-bot) and despite having the tracking protection extra-label, those issues might be of a different nature.

Confidence threshold > 0.8 - 724 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.77      0.15      1.00      0.25      0.38      0.13        68
             0       0.94      0.97      0.40      0.95      0.62      0.41       693

Confidence threshold > 0.9 - 672 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       0.89      0.12      1.00      0.21      0.34      0.11        68
             0       0.95      0.91      0.53      0.93      0.69      0.50       693

Confidence threshold > 0.95 - 604 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       1.00      0.06      1.00      0.11      0.24      0.05        68
             0       0.96      0.83      0.65      0.89      0.73      0.55       693

Confidence threshold > 0.97 - 531 classified

                       pre       rec       spe        f1       geo       iba       sup

             1       1.00      0.04      1.00      0.08      0.21      0.04        68
             0       0.98      0.74      0.82      0.85      0.78      0.61       693

Summary: Unfortunately, the model doesn’t show results that would work for us as the number of issues it able to find with an acceptable precision is very low

ksy36 avatar Jan 07 '22 19:01 ksy36

Other labels

I’ve tried building a few other models and didn’t get acceptable results.

### type-media

One interesting observation is despite the fact that type-media label has quite a lot of issues labelled with it (almost 800) and the model metrics were very good, it didn’t do a good job of accurately predicting media issues. Turns out there was an inflation in metrics due to a lot of such issues (636):

https://github.com/webcompat/web-bugs/issues/7925 https://github.com/webcompat/web-bugs/issues/7974

The issues contained technical information with Error codes and label type-media was added by the bot. We don’t receive issues with this content anymore, however, the model was biased towards them. After removing most of these issues, the model, unfortunately, didn’t show good results.

ksy36 avatar Jan 07 '22 19:01 ksy36