wikiloop-doublecheck Help migrating away from ORES

Hi! I am part of the Wikimedia ML team, we are starting the migration of ORES client to another infrastructure, since we are planning to deprecate it. More info in https://wikitech.wikimedia.org/wiki/ORES

TL;DR:

The ORES infrastructure is going to be replaced by Lift Wing, a more modern and kubernetes-based service. All the ORES models (damaging, goodfaith, etc..) are running on Lift Wing, more on how to use them in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage We have new models called Revert Risk, to replace goodfaith and damaging for example. The are available on Lift Wing, and we'd like to offer them as valid and more precise/performant alternative to ORES models. If you'd like to try them we'd help in the migration process! Thanks in advance,

ML team

Jul 28 '23 14:07 isaranto

Thanks for opening your first issue here! Be sure to follow the issue template!

Jul 28 '23 14:07 welcome[bot]

Hi, @isaranto , that would be awesome!

Jul 30 '23 01:07 xinbenlv

Hello! We have noticed that Wikiloop might be using the mediawiki.revision-score stream. However, the mediawiki.revision-score stream will also be deprecated with ORES. For users who use the stream, the Wikimedia ML team plans to offer several streams, each associated with a single model score, such as:

mediawiki.revision-score-goodfaith mediawiki.revision-score-damaging

Alternatively, we have new models called Revert Risk to replace goodfaith and damaging, and we could provide a stream for the revert-risk score.

If Wikiloop is currently ingesting events from the mediawiki.revision-score stream, please let us know your preference.

You can find more information in our thread: https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/X5KUTNHW646KYGE7V6SDSHVGVOL5DFDX/

Aug 08 '23 10:08 AikoChou

@xinbenlv Hi! Is what @AikoChou wrote good in your opinion? We are trying to figure out remaining users of the revision-score stream :)

Sep 06 '23 15:09 elukey

I will take a look. thank you!

Sep 06 '23 15:09 xinbenlv

It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith

Sep 06 '23 15:09 xinbenlv

It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith

@xinbenlv could you clarify the above point? More specifically, we'd need to understand if you'd need streams or if you'b be happy to query the new API (https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage).

We also offer a new model called Revert Risk Language Agnostic (specs, API), that should be a replacement of both damaging and goodfaith (they are still available via Lift Wing though, if needed).

Sep 07 '23 08:09 elukey

let me give a bit context about why we use ORES in WikiLoop DoubleCheck in the firstplace: WikiLoop DoubleCheck intends to "put human in the loop" for fact checking with "AI support", so we use ORES to find "borderline suspicious edits".

"Borderline means:

when an edit is obviously bad, it's an easy revert, it's less valuable taking human's time.
when an edit is obviously good, it's an easy ok, depriorize for review too.
when an edit is neither obviously good nor obviously bad, it's the best use of human's time.

With such context, what's your suggested API?

Sep 07 '23 16:09 xinbenlv

@xinbenlv thanks for the explanation! I'd go for Revert Risk for two reasons:

It is a brain new model, trained with recent data, and fully supported by the WMF Research team. The goodfaith/damaging models are still supported but they will not be improved any further, since they are old and difficult to manage (so we'd prefer to simply deprecate them in the future).
It gives a single score on a specific rev-id, assigning to it a value that tells how much confident the model is that a revert needs to happen. Based on this score value you can decide whether it fits in our obviously good/bad use cases, or not. The score is basically a probability, so something like 1-10% or 95-99% could be ranges that you don't want a human involved, meanwhile for the rest yes (I am writing numbers without much thinking, just to give an idea :)).

On the implementation side, we (as ML WMF) are trying to deprecate the revision-score stream from https://stream.wikimedia.org since we'd like to break it down into multiple ones. Basically instead of having a lot of scores fro m different models for every revision-id (like in revision-score), we will have a stream for every model (rev-id -> model score). We still don't have a stream for Revert Risk, but we are planning to add one soon-ish.

We checked your code and we found references of revision-score, so what we are wondering is:

Are you still actively consuming data from it? Or do you get your scores directly from the ORES API on demand?
If you use the stream, would it be ok to move to another stream (like Revert Risk, if you decide to migrate to that model) during the next couple of months (waiting for us to make it available)? In this case it would be without any data from revision-score, since we'd deprecate it for good.

We don't want to break users, so we are trying to follow up as best as we can to support all of you :) Lemme know!

Sep 09 '23 13:09 elukey

To be more precise: https://github.com/google/wikiloop-doublecheck/blob/master/server/ingest/ores-stream.ts#L26

The above is the snippet of code that we are referring to, but since I don't see any trace of traffic from you related to it, I am wondering if it is running or not :)

Sep 12 '23 12:09 elukey

@xinbenlv thoughts? :)

Sep 13 '23 16:09 elukey

Sorry for a late response. Let me take a look

Sep 14 '23 20:09 xinbenlv

Thanks! We have already stopped the stream (https://phabricator.wikimedia.org/T342116), lemme know if it impacts your project.

Sep 15 '23 07:09 elukey

wikiloop-doublecheck wikiloop-doublecheck copied to clipboard

Help migrating away from ORES

wikiloop-doublecheck
wikiloop-doublecheck copied to clipboard