wikiloop-doublecheck
wikiloop-doublecheck copied to clipboard
Help migrating away from ORES
Hi! I am part of the Wikimedia ML team, we are starting the migration of ORES client to another infrastructure, since we are planning to deprecate it. More info in https://wikitech.wikimedia.org/wiki/ORES
TL;DR:
The ORES infrastructure is going to be replaced by Lift Wing, a more modern and kubernetes-based service. All the ORES models (damaging, goodfaith, etc..) are running on Lift Wing, more on how to use them in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage We have new models called Revert Risk, to replace goodfaith and damaging for example. The are available on Lift Wing, and we'd like to offer them as valid and more precise/performant alternative to ORES models. If you'd like to try them we'd help in the migration process! Thanks in advance,
ML team
Thanks for opening your first issue here! Be sure to follow the issue template!
Hi, @isaranto , that would be awesome!
Hello! We have noticed that Wikiloop might be using the mediawiki.revision-score
stream. However, the mediawiki.revision-score
stream will also be deprecated with ORES. For users who use the stream, the Wikimedia ML team plans to offer several streams, each associated with a single model score, such as:
mediawiki.revision-score-goodfaith mediawiki.revision-score-damaging
Alternatively, we have new models called Revert Risk to replace goodfaith and damaging, and we could provide a stream for the revert-risk score.
If Wikiloop is currently ingesting events from the mediawiki.revision-score
stream, please let us know your preference.
You can find more information in our thread: https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/X5KUTNHW646KYGE7V6SDSHVGVOL5DFDX/
@xinbenlv Hi! Is what @AikoChou wrote good in your opinion? We are trying to figure out remaining users of the revision-score stream :)
I will take a look. thank you!
It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith
It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith
@xinbenlv could you clarify the above point? More specifically, we'd need to understand if you'd need streams or if you'b be happy to query the new API (https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage).
We also offer a new model called Revert Risk Language Agnostic (specs, API), that should be a replacement of both damaging and goodfaith (they are still available via Lift Wing though, if needed).
let me give a bit context about why we use ORES in WikiLoop DoubleCheck in the firstplace: WikiLoop DoubleCheck intends to "put human in the loop" for fact checking with "AI support", so we use ORES to find "borderline suspicious edits".
"Borderline means:
- when an edit is obviously bad, it's an easy revert, it's less valuable taking human's time.
- when an edit is obviously good, it's an easy ok, depriorize for review too.
- when an edit is neither obviously good nor obviously bad, it's the best use of human's time.
With such context, what's your suggested API?
@xinbenlv thanks for the explanation! I'd go for Revert Risk for two reasons:
- It is a brain new model, trained with recent data, and fully supported by the WMF Research team. The goodfaith/damaging models are still supported but they will not be improved any further, since they are old and difficult to manage (so we'd prefer to simply deprecate them in the future).
- It gives a single score on a specific rev-id, assigning to it a value that tells how much confident the model is that a revert needs to happen. Based on this score value you can decide whether it fits in our obviously good/bad use cases, or not. The score is basically a probability, so something like 1-10% or 95-99% could be ranges that you don't want a human involved, meanwhile for the rest yes (I am writing numbers without much thinking, just to give an idea :)).
On the implementation side, we (as ML WMF) are trying to deprecate the revision-score
stream from https://stream.wikimedia.org since we'd like to break it down into multiple ones. Basically instead of having a lot of scores fro m different models for every revision-id (like in revision-score
), we will have a stream for every model (rev-id -> model score). We still don't have a stream for Revert Risk, but we are planning to add one soon-ish.
We checked your code and we found references of revision-score
, so what we are wondering is:
- Are you still actively consuming data from it? Or do you get your scores directly from the ORES API on demand?
- If you use the stream, would it be ok to move to another stream (like Revert Risk, if you decide to migrate to that model) during the next couple of months (waiting for us to make it available)? In this case it would be without any data from
revision-score
, since we'd deprecate it for good.
We don't want to break users, so we are trying to follow up as best as we can to support all of you :) Lemme know!
To be more precise: https://github.com/google/wikiloop-doublecheck/blob/master/server/ingest/ores-stream.ts#L26
The above is the snippet of code that we are referring to, but since I don't see any trace of traffic from you related to it, I am wondering if it is running or not :)
@xinbenlv thoughts? :)
Sorry for a late response. Let me take a look
Thanks! We have already stopped the stream (https://phabricator.wikimedia.org/T342116), lemme know if it impacts your project.