Open-Assistant
Open-Assistant copied to clipboard
filtering troll data
in order to reach more representative labeling-results there is a need of filtering purposely inappropriate labeling. Potential solutions and implementations should be discussed here.
All labels belonging to users identified as being most probably saboteurs or trolls should be taken out and be replaced by the results of additional labeling-rounds.
As this might have an impact on the already existing trees there is some kind of time pressure for this discussion.
But even, if the filters might not apply to the already given data, it should at least be applied to the upcoming data.
Suggestion 1)
To avoid erasing all appreciated heterogeneity in the data it’s important to calculate a certain threshold giving a statistical significant signal indicating that the user is not just unconventional but with 99% probability systematic and purposely harming the data.
If some satistician is around it would be great if you posted the according proper calculation. It’s important to not recalculate the distributions after filtering has been applied.
Suggestion 2)
Labels that were given in a too short time most likely were not made due to thoughtful considerations. This can either be a standalone indicator or used in conjunction with the signal of suggestion 1. As saboteurs or trolls might adopt to this monitoring, it should in my opinion be sufficient but not necessary to indicate purpose. The calculated time a user at least should need to take labeling decisions should proportional correlate with the amount of text the labeling is referring to. This screening method would also need a certain percentual threshold in order to signal purposely misuse.
I encourage everyone to discuss this issue and implement consensual solutions.
A few thoughts:
- Time spent looking at text, even for thoughtful reviewers, is not necessarily proportional to how much text there is. Let's say the task is formatting or grammatical, like "change the last sentence of this email...". Or an assistant response is completely wrong and about the wrong subject. In those cases, you can see that the format is correct or not by reading some part of the message and not the entire message. It's not uncommon for me to know almost instantly after seeing the message what is the correct rating to give.
- Adversaries could easily set up more accounts to "slow down" submission times without slowing down their labeling bandwidth, thwarting any timing-based defenses.
- If you are an expert about something and consistently label counter to everyone else because no one else labeled it right, should you be punished for that?
- We should be extra careful not to exclude the work of people we don't agree with from the dataset.
I agree with your firs objection. There is no tight correlation. I just thought I’d be better to consider this relation than just using a single time value like half a sec or something. I imagine there is a clear observable statistical gap between the average normal distributed reaction-times and those belonging to trolls. So maybe it’s sufficient to use just one minimum time.‚ It’s important to mention that undercutting that value has to apply for e.g. 95% of all labelings the user does. This will allow almost all cases In which a user can by chance answer pretty fast.
But your second objection is pretty strong. So using this „time-gate“ could not prevent such efforts but at least filter the lazy trolls.
Regarding the last two objectives I think it is very unlikely, that someone by chance ( or in more that 95% of the cases) only has to rate messages that belong to his expertise and therefore lead to extreme and different judgements. This happens, but as the average of the labeling behavior is considered it will not lead to exceedance of the calculated significant value reflecting 95%.
I'd like to stress out, that I do not want to establish censorship. I just think there are statistical methods to efficiently get rid of at least some purposely biasing troll-data without affecting just volatil labeling. behavior.
I think releasing the data in full, with only legal things like PII removed, and with users and votes identified by hash or something would help. Then we have something that the statistically minded can investigate and explore. Like it may be possible to group people into clusters and find "bad neighbourhoods" like how Google's PageRank identified link-spam.
Also this would be a great area of investigation for data science students looking for some real world experience. I'm guessing there's a lot of people from the academic world in the community.
Make it accessible to everyone for statistical investigations sounds interesting. But I can't remember a consent form relating to this. So I am afraid that might not be possible.