wikiloop-doublecheck Project Proposal: Enhance Cross-Edit Suspicious Pattern Detection for WikiLoop Battlefield

Project Proposal: Enhance Cross-Edit Suspicious Pattern Detection for WikiLoop Battlefield

Open xinbenlv opened this issue 4 years ago • 0 comments

Author: [email protected], TODO: add more

Project advisor: wenjies(to invite), ahalfaker(to invite)

Background

There are many cross-edit pattern identification that's not well supported by existing tools, e.g.

repeated vandalism by the same user
repeated vandalism on the same article
edit war
canvassing
possible socket puppeting

Such kind of larger scale problematic or damaging behavior usually involves multiple edits and is hard to detect by simply reviewing individual edits. WP:RUNAWAY

The project intend to make detecting such behaviors easier for Wikipedians.

The eng milestones

Milestone 1: prototyping a Trivial Detection and headroom analyse:

Propose a hypothesis on what kinds of patterns could be identified in what data
Come up with some early data analysis programs to identify existing cases.
Publish the findings as a paper describing the detection as "Trivial Detection" and how much behavioral issues in number of articles and number of users does the program identify.

Datasets: we will prepare a large scale dataset with all revisions we have in the wikipedia, plus their scores of ORES and WikiTrust scores.

For small datasets, we can get it from API:Revisions, such as

Sample query for fetching revisions of multiple wikipedia articles' latest revision:

curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Whitehouse%7CCoronavirus&formatversion=2&rvprop=timestamp%7Cuser%7Coresscores%7Ccomment%7Cids&rvslots=main'

Sample query for fetching revisions with ORES scores by article

curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Coronavirus&rvprop=oresscores%7Ctimestamp%7Cuser%7Ccomment%7Cids&rvslots=main&rvlimit=5&rvdir=older'

Look at https://www.mediawiki.org/wiki/API:Revisions for more examples

Milestone 2: production

Sub-project 1: Trivial detection integrate with WikiLoop Battlefield

Eng Design how to integrate such pipeline into WikiLoop Battlefield
Identify the backend challenge and design data structure to improve performance so as it work on a real-time review system like WikiLoop Battlefield
Identify the frontend challenge and design for better user experience such as prefetching and caching, strcuture the code layout for better future expansion.
Implement backend and frontend features for identifying the suspicious behavior and prompt the user to submit WP:AIV, WP:BLOCK or WP:PROTECT, see #223

Sup-project 2: Machine Learning to improve detection, compared with trivial detection baseline

design, build and tune machine learning models to improve precision and recall for identify pages under active damaging or suspicious users. Use WP:BLOCK with examples for blocked user to assess if it's likely a user is going to be blocked, and recommend for blocking. Use WP:PROTECT with examples fo protected page to assess if it's likely a page is going to be protected.

May 18 '20 23:05 xinbenlv

wikiloop-doublecheck wikiloop-doublecheck copied to clipboard

Project Proposal: Enhance Cross-Edit Suspicious Pattern Detection for WikiLoop Battlefield

Background

The eng milestones

Milestone 1: prototyping a Trivial Detection and headroom analyse:

Milestone 2: production

Sub-project 1: Trivial detection integrate with WikiLoop Battlefield

Sup-project 2: Machine Learning to improve detection, compared with trivial detection baseline

wikiloop-doublecheck
wikiloop-doublecheck copied to clipboard