Euphemism icon indicating copy to clipboard operation
Euphemism copied to clipboard

Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021

Python 3.7 License: MIT

Self-Supervised Euphemism Detection and Identification for Content Moderation

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content Moderation (42nd IEEE Symposium on Security and Privacy 2021).

Table of Contents

  • Introduction
  • Requirements
  • Data
  • Code
  • Acknowledgement
  • Citation

Introduction

This project aims at Euphemism Detection and Euphemism Identification.

Requirements

The code is based on Python 3.7. Please install the dependencies as below:

pip install -r requirements.txt

Data

Due to the license issue, we will not distribute the dataset ourselves, but we will direct the readers to their respective sources.

Drug:

Weapon:

Sexuality:

Sample:

  • Raw Text Corpus: we provide a sample dataset data/sample.txt for the readers to run the code.
  • Ground Truth: same as the Drug dataset (see data/euphemism_answer_drug.txt and data/target_keywords_drug.txt).
  • This Sample dataset is only for you to play with the code and it does not represent any reliable results.

Code

1. Fine-tune the BERT model.

Please refer to this link from Hugging Face to fine-tune a BERT on a raw text corpus.

You may download our pre-trained BERT model on the reddit text corpus (from the Drug dataset) here. Please unzip it and put it under data/.

2. Euphemism Detection and Euphemism Identification

python ./Main.py --dataset sample --target drug  

You may find other tunable arguments --- c1, c2 and coarse to specify different classifiers for euphemism identification. Please go to Main.py to find out their meanings.

Baselines:

Please refer to baselines/README.md.

Acknowledgement

We use the code here for the text classification in Pytorch.

Citation

@inproceedings{zhu2021selfsupervised,
    title = {Self-Supervised Euphemism Detection and Identification for Content Moderation},
    author = {Zhu, Wanzheng and Gong, Hongyu and Bansal, Rohan and Weinberg, Zachary and Christin, Nicolas and Fanti, Giulia and Bhat, Suma},
    booktitle = {42nd IEEE Symposium on Security and Privacy},
    year = {2021}
}