Euphemism
Euphemism copied to clipboard
Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021
Self-Supervised Euphemism Detection and Identification for Content Moderation
This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content Moderation (42nd IEEE Symposium on Security and Privacy 2021).
Table of Contents
- Introduction
- Requirements
- Data
- Code
- Acknowledgement
- Citation
Introduction
This project aims at Euphemism Detection and Euphemism Identification.
Requirements
The code is based on Python 3.7. Please install the dependencies as below:
pip install -r requirements.txt
Data
Due to the license issue, we will not distribute the dataset ourselves, but we will direct the readers to their respective sources.
Drug:
-
Raw Text Corpus: Please request the raw text corpus ---
reddit.csv
from Wanzheng Zhu ([email protected]) or Professor Nicolas Christin. -
Ground Truth: we summarize the drug euphemism ground truth list (provided by the DEA Intelligence Report -- Slang Terms and Code Words: A Reference for Law Enforcement Personnel) in
data/euphemism_answer_drug.txt
anddata/target_keywords_drug.txt
.
Weapon:
- Raw Text Corpus: Please request the dataset from What is gab: A bastion of free speech or an alt-right echo chamber (Zanettou et al. 2018), Identifying products in online cybercrime marketplaces: A dataset for fine-grained domain adaptation (Durrett et al. 2017), Tools for Automated Analysis of Cybercriminal Markets (Portnoff et al. 2017), and the examples on Slangpedia.
- Ground Truth: Please refer to The Online Slang Dictionary, Slangpedia, and The Urban Thesaurus.
Sexuality:
- Raw Text Corpus: We use 2,894,869 processed Gab posts, collected from Jan 2018 to Oct 2018 by PushShift.
- Ground Truth: Please refer to The Online Slang Dictionary.
Sample:
-
Raw Text Corpus: we provide a sample dataset
data/sample.txt
for the readers to run the code. -
Ground Truth: same as the Drug dataset (see
data/euphemism_answer_drug.txt
anddata/target_keywords_drug.txt
). - This Sample dataset is only for you to play with the code and it does not represent any reliable results.
Code
1. Fine-tune the BERT model.
Please refer to this link from Hugging Face to fine-tune a BERT on a raw text corpus.
You may download our pre-trained BERT model on the reddit
text corpus (from the Drug dataset) here. Please unzip it and put it under data/
.
2. Euphemism Detection and Euphemism Identification
python ./Main.py --dataset sample --target drug
You may find other tunable arguments --- c1
, c2
and coarse
to specify different classifiers for euphemism identification.
Please go to Main.py
to find out their meanings.
Baselines:
Please refer to baselines/README.md
.
Acknowledgement
We use the code here for the text classification in Pytorch.
Citation
@inproceedings{zhu2021selfsupervised,
title = {Self-Supervised Euphemism Detection and Identification for Content Moderation},
author = {Zhu, Wanzheng and Gong, Hongyu and Bansal, Rohan and Weinberg, Zachary and Christin, Nicolas and Fanti, Giulia and Bhat, Suma},
booktitle = {42nd IEEE Symposium on Security and Privacy},
year = {2021}
}