detoxify false positive

wtf? for some reason this message is flagged as toxic: "who selling lup pots" can you fix? using original data set

Jul 09 '22 16:07 ghost

Thanks for reporting this example. If you notice any pattern in the examples the models flag falsely as toxic, it would be very useful if you could share it. In order for us to improve the models some useful information would be:

the type of model you ran
the data you ran it on
false positive / false negative examples and patterns (grammar, topic etc) you noticed

Jul 12 '22 17:07 anitavero

Hey, Even I have come across this False Positive issue. I was working with a model to detect offensive text in a given dataset. For example, I had few records having string as Shital, which is a name, not an offensive word. So few of such records were being classified as Toxic while rest as Non-Toxic. Same was a case with records having word 'Nishit' that's also a name.

I tried to find out any pattern for being classified as toxic for few records and rest time as non-toxic, but nothing was there to be noticed.

Let me know if there's any work around you guys have come up or working on it.

Jul 14 '22 20:07 smasterparth

It matters a lot which version of the model you use: "original", "unbiased" or "multilingal".

For me "unbiased" solves the problem with "Nishit", and mildly mitigates with "Shital" (although it's still high).
"who selling lup pots" isn't flagged as toxic for me by any of the models.

from detoxify import Detoxify

input_text = ['Shital', 'Nishit', "who selling lup pots"]
model_u = Detoxify('unbiased')
model_o = Detoxify('original')
model_m = Detoxify('multilingual')

results_u = model_u.predict(input_text)
results_o = model_o.predict(input_text)
results_m = model_m.predict(input_text)

print("Original", pd.DataFrame(results_o, index=input_text).round(2))
print("Multilingual", pd.DataFrame(results_m, index=input_text).round(2))
print("Unbiased", pd.DataFrame(results_u, index=input_text).round(2))

This outputs:

Original                       toxicity  severe_toxicity  obscene  threat  insult  identity_attack
Shital                    0.82             0.01     0.57    0.00    0.05             0.00
Nishit                    0.71             0.04     0.52    0.01    0.39             0.24
who selling lup pots      0.00             0.00     0.00    0.00    0.00             0.00
Multilingual                       toxicity  severe_toxicity  obscene  identity_attack  insult  threat  sexual_explicit
Shital                    0.82             0.00     0.54              0.0    0.41     0.0             0.01
Nishit                    0.87             0.01     0.82              0.0    0.14     0.0             0.02
who selling lup pots      0.01             0.00     0.00              0.0    0.00     0.0             0.00
Unbiased                       toxicity  severe_toxicity  obscene  identity_attack  insult  threat  sexual_explicit
Shital                    0.67              0.0     0.21              0.0    0.03     0.0             0.52
Nishit                    0.06              0.0     0.01              0.0    0.01     0.0             0.00
who selling lup pots      0.01              0.0     0.00              0.0    0.00     0.0             0.00

Let us know if you find any other issues! If you could attach model outputs similar to the above one, that would be really helpful!

Jul 15 '22 13:07 anitavero

@anitavero original model also outputs very high false positive toxicity value for the following text: "They had great sex!"

{'toxicity': 0.88951826, 'severe_toxicity': 0.0110040745, 'obscene': 0.4631456, 'threat': 0.0027411387, 'insult': 0.021174002, 'identity_attack': 0.0034398066}

Oct 09 '23 11:10 ogencoglu

Also for this one: "Sucking power of this vacuum cleaner is great!"

Oct 22 '23 15:10 ogencoglu

detoxify detoxify copied to clipboard

false positive

detoxify
detoxify copied to clipboard