wikiloop-doublecheck icon indicating copy to clipboard operation
wikiloop-doublecheck copied to clipboard

Project Proposal: Enhance Cross-Edit Suspicious Pattern Detection for WikiLoop Battlefield

Open xinbenlv opened this issue 4 years ago • 0 comments

Author: [email protected], TODO: add more

Project advisor: wenjies(to invite), ahalfaker(to invite)

Background

There are many cross-edit pattern identification that's not well supported by existing tools, e.g.

Main references: WP:DISRUPT, WP:VANDAL

Such kind of larger scale problematic or damaging behavior usually involves multiple edits and is hard to detect by simply reviewing individual edits. WP:RUNAWAY

The project intend to make detecting such behaviors easier for Wikipedians.

The eng milestones

Milestone 1: prototyping a Trivial Detection and headroom analyse:

  • Propose a hypothesis on what kinds of patterns could be identified in what data
  • Come up with some early data analysis programs to identify existing cases.
  • Publish the findings as a paper describing the detection as "Trivial Detection" and how much behavioral issues in number of articles and number of users does the program identify.

Datasets: we will prepare a large scale dataset with all revisions we have in the wikipedia, plus their scores of ORES and WikiTrust scores.

For small datasets, we can get it from API:Revisions, such as

Sample query for fetching revisions of multiple wikipedia articles' latest revision:

curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Whitehouse%7CCoronavirus&formatversion=2&rvprop=timestamp%7Cuser%7Coresscores%7Ccomment%7Cids&rvslots=main'

Sample query for fetching revisions with ORES scores by article

curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Coronavirus&rvprop=oresscores%7Ctimestamp%7Cuser%7Ccomment%7Cids&rvslots=main&rvlimit=5&rvdir=older'

Look at https://www.mediawiki.org/wiki/API:Revisions for more examples

Milestone 2: production

Sub-project 1: Trivial detection integrate with WikiLoop Battlefield

  • Eng Design how to integrate such pipeline into WikiLoop Battlefield
  • Identify the backend challenge and design data structure to improve performance so as it work on a real-time review system like WikiLoop Battlefield
  • Identify the frontend challenge and design for better user experience such as prefetching and caching, strcuture the code layout for better future expansion.
  • Implement backend and frontend features for identifying the suspicious behavior and prompt the user to submit WP:AIV, WP:BLOCK or WP:PROTECT, see #223

Sup-project 2: Machine Learning to improve detection, compared with trivial detection baseline

  • design, build and tune machine learning models to improve precision and recall for identify pages under active damaging or suspicious users. Use WP:BLOCK with examples for blocked user to assess if it's likely a user is going to be blocked, and recommend for blocking. Use WP:PROTECT with examples fo protected page to assess if it's likely a page is going to be protected.

xinbenlv avatar May 18 '20 23:05 xinbenlv