wikiloop-doublecheck
wikiloop-doublecheck copied to clipboard
Project Proposal: Enhance Cross-Edit Suspicious Pattern Detection for WikiLoop Battlefield
Author: [email protected], TODO: add more
Project advisor: wenjies(to invite), ahalfaker(to invite)
Background
There are many cross-edit pattern identification that's not well supported by existing tools, e.g.
- repeated vandalism by the same user
- repeated vandalism on the same article
- edit war
- canvassing
- possible socket puppeting
Main references: WP:DISRUPT, WP:VANDAL
Such kind of larger scale problematic or damaging behavior usually involves multiple edits and is hard to detect by simply reviewing individual edits. WP:RUNAWAY
The project intend to make detecting such behaviors easier for Wikipedians.
The eng milestones
Milestone 1: prototyping a Trivial Detection and headroom analyse:
- Propose a hypothesis on what kinds of patterns could be identified in what data
- Come up with some early data analysis programs to identify existing cases.
- Publish the findings as a paper describing the detection as "Trivial Detection" and how much behavioral issues in number of articles and number of users does the program identify.
Datasets: we will prepare a large scale dataset with all revisions we have in the wikipedia, plus their scores of ORES and WikiTrust scores.
For small datasets, we can get it from API:Revisions, such as
Sample query for fetching revisions of multiple wikipedia articles' latest revision:
curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Whitehouse%7CCoronavirus&formatversion=2&rvprop=timestamp%7Cuser%7Coresscores%7Ccomment%7Cids&rvslots=main'
Sample query for fetching revisions with ORES scores by article
curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&titles=Coronavirus&rvprop=oresscores%7Ctimestamp%7Cuser%7Ccomment%7Cids&rvslots=main&rvlimit=5&rvdir=older'
Look at https://www.mediawiki.org/wiki/API:Revisions for more examples
Milestone 2: production
Sub-project 1: Trivial detection integrate with WikiLoop Battlefield
- Eng Design how to integrate such pipeline into WikiLoop Battlefield
- Identify the backend challenge and design data structure to improve performance so as it work on a real-time review system like WikiLoop Battlefield
- Identify the frontend challenge and design for better user experience such as prefetching and caching, strcuture the code layout for better future expansion.
- Implement backend and frontend features for identifying the suspicious behavior and prompt the user to submit WP:AIV, WP:BLOCK or WP:PROTECT, see #223
Sup-project 2: Machine Learning to improve detection, compared with trivial detection baseline
- design, build and tune machine learning models to improve precision and recall for identify pages under active damaging or suspicious users. Use WP:BLOCK with examples for blocked user to assess if it's likely a user is going to be blocked, and recommend for blocking. Use WP:PROTECT with examples fo protected page to assess if it's likely a page is going to be protected.