Detoxify doesn't work well on Emojis

Open laurahanu opened this issue 4 years ago • 1 comments

Currently all detoxify models seem to not recognize emojis that are meant to be toxic/hateful in context or on their own (#26). While the Bert tokenizer returns the same output for different emojis, Roberta-based tokenizers seem to differentiate between different emoji inputs.

Some potential solutions:

replacement method (fast): use an emoji library (e.g. demoji) and replace current emojis with their text description (i.e. 🖕 -> 'middle finger'). While this would work in some cases (when emojis are used with their literal meaning), there will be some cases where the description wouldn't make the intended meaning clearer e.g. drugs or sexually-related emojis. We would also need to be careful with how/when we're using emojis as keywords (could check for key emojis first and then replace).
training method (slow): train models to recognise various emojis under different contexts, might also be something that emerges naturally by training on lots of data containing emojis. Might work with the common use cases, but work less well with lesser used emojis. Would not work with the Bert tokenizer.
hybrid method where we train with emoji descriptions directly and replace them at inference time

To dos:

[x] investigate how well the replacement method works on a dataset like Hatemoji
[ ] finetune Detoxify with Hatemoji train set and compare

Aug 23 '21 10:08 laurahanu

Detoxify results on the Hatemoji test set. Random chance probability is: 0.561473 Majority class (always guessing 1 in this case) guessing probability is: 0.675318

without emoji replacement

	original	unbiased	multilingual
f1	0.365546	0.391728	0.643069
accuracy	0.462087	0.476081	0.60229
precision	0.89823	0.906977	0.816232
recall	0.229465	0.249812	0.53052
average_precision	0.726469	0.733189	0.750076

with emoji replacement

	original	unbiased	multilingual
f1	0.432323	0.48832	0.727654
accuracy	0.499491	0.531807	0.66972
precision	0.923551	0.932059	0.821023
recall	0.282216	0.330821	0.653353
average_precision	0.745373	0.760254	0.770515

*Took the identity_hate scores for original and unbiased and the toxicity ones for multilingual since it's the only label it was trained on

Aug 23 '21 13:08 laurahanu