Only evaluate spambug on bugs filed by people without "editbugs" permissions, then check if it's better to train on all or just non-editbugs
The spambug model is only applied to bugs filed by people without "editbugs" permissions, so it makes sense to only evaluate it on these kinds of bugs and not all bugs.
For training, we can keep using all bugs, but we should check if the performance improves or worsens in case we only use bugs filed by non-editbugs people.
Trying to wrap myself around this issue, and correct me if I'm wrong:
Is the issue supposed to investigate training performance after training spambug on all bugs vs training on bugs filed by non-editbugs people?
Currently, we train on all bugs. Is their a field that concerned with the "editbugs" permisions? @suhaibmujahid
The goal of this issue is two fold:
- Only evaluate the model on bugs filed by people without editbugs permissions;
- Compare the performance of the model when we train on all bugs and when we train only on bugs filed by people with editbugs permissions.
@jpangas unfortunately this issue might be problematic for you to fix, as you'd need special permissions to see which users have editbugs permissions.
A workaround could be checking if the user's email belongs to a Mozilla employee or not (e.g., ends with @mozilla.com).
This will not catch all cases, but it could perform better in the context of the training dataset (item 2) since it will catch cases such a bug was filled with users who had editbugs permissions but not anymore.
In the context of item 1, depending on the editbugs permissions will show more realistic results.
@marco-c wdyt?
2.Compare the performance of the model when we train on all bugs and when we train only on bugs filed by people with editbugs permissions.
in 2) Did you mean when we train on all bugs vs when we train on bugs filed by people with editbugs or you actually meant to say when we train on bugs filed by people with non-editbugs permissions only. (which we do currently).
https://github.com/mozilla/bugbug/blob/f9906057a5281b8913fb5a92edbe73440953581b/bugbug/models/spambug.py#L87-L89 Currently we train only on bugs filed by people with non-mozillians (I assume these people have non-editbugs permissions.) This would be one of the ways we can test out performance when we include bugs filed by mozillians. (inline with what @suhaibmujahid has suggested.)
Yes sorry, I meant train on bugs filed by people without editbugs permissions. Currently we skip @mozilla.com only, the goal of this issue would be to check what changes if we also skip, for training, people with editbugs permissions (since we are sure they are not filing spam bugs).
For evaluation we should always skip them, as we are doing it in production and we want to measure exactly what happens on production.
You can retrieve the list of users with editbugs by doing bugzilla.get_groups_users(["editbugs", "editbugs-team"]), the problem is that you can't test it yourself but need us to test it.
In the model, we could do something like (pseudocode):
try:
userswitheditbugs = ...
except PermissionDenied:
userswitheditbugs = set()
...
if "@mozilla" in creator or creator in userswitheditbugs:
skip
P.S.: as part of this, we should also skip people with "@softvision" in their email address.
Great. Thanks, I'm on it and I will open a PR once everything is ready.
@jpangas we already have a feature to check if the user is a mozillian, but it is not used in the spambug model:
https://github.com/mozilla/bugbug/blob/f9906057a5281b8913fb5a92edbe73440953581b/bugbug/bug_features.py#L177-L184
Thanks @suhaibmujahid