SmokeDetector Add reason post is likely nonsense

This PR attempts to find out vandalism and gibberish by calculating the informational entropy of a given text. The constants used in this PR is conservative.

Aug 06 '20 18:08 user12986714

What's the standard deviation of entropy-per-char? 3.0 seems quite close to 2.6 to me, but it depends - that might be super unlikely or relatively likely; the standard deviation will reveal which it is.

Aug 06 '20 19:08 ArtOfCode-

Well, entropy per char for English + space IIRC is ~21...

Aug 07 '20 01:08 user12986714

That's... not what I'm asking. You have a comment in here that says "Average entropy per char in English is 2.6". If that's the average, what's the stddev?

Aug 07 '20 20:08 ArtOfCode-

The entropy values here are broken, every post gets caught.

Legit posts:

“I have seen the discussion about the Turkish Airlines COVID Cabin policy which makes little sense. Regardless though, does anyone know if they are enforcing it? I, like many, will be transferring at a European Airport on two tickets issued separately. I can't check my luggage all the way through from Istanbul to Malaga and can't exit customs to collect the luggage in Brussels (without a forced quarantine or denied entry)” - entropy per char of 0.2332
“Why all Indian rupee notes are accepted in Nepal and Bhutan, except 500 Rs and 1000 Rs? Why spare those two notes?” - entropy per char of 0.2485

Gibberish posts:

this this this this spamd dshdshdshdshhds - entropy per char of 0.4045
“test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test” - entropy per char of 0.4898
“gibberish dshdshdsaasdlaf,afdasfkkdafkafdkdfkfdskdsfksdkfksd.fk.sdfksfdkfk.fsdk.sdfksfdkfsdkfsdk” - entropy per char of 0.3817
“dfahdfhdsfsfkdjjsldfksdflfsdlkjfldskjlkfdsjklsfd/jldsfjlsdfjsdfakjsdafjkfsjkldfsaklfdklsalkfdsaklsdadsaklasfdkldfsaljkdfslkaldsfkdslflfsddfskjsdfllsfdaladsflfdsalsdalfksdklafkdsafdsafdsaklkldfsakfdkslaflklfdsakldfsalfdldsaflkasfldjkdfslsdklaflkdsfakjlfsadkljfsakljafsdlkjsdfjjfsdljasdladsfljkfdsjldfsjldsfjsdf“ - entropy per char of 0.4025

I think the entropy values need to be adjusted according to these results

Aug 08 '20 13:08 ghost

A stat with 12405 fp posts on MS

>>> statistics.mean(result)
0.20483261275004847
>>> statistics.median(result)
0.20223865427238322
>>> statistics.stdev(result)
0.031230117152319384

So yes, I managed to mess up with the decimal point

Note: fp is defined as:

>>> def is_fp(post):
...     fp_count = 0
...     tp_count = 0
...     for fb in post['feedback']:
...             if fb[1].startswith("f"):
...                     fp_count += 1
...             elif fb[1].startswith("t"):
...                     tp_count += 1
...     return (fp_count - tp_count > 1) or ((fp_count > 0) and (tp_count ==0))

Aug 08 '20 15:08 user12986714

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

Sep 11 '20 00:09 stale[bot]

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

Sep 12 '20 05:09 NobodyNada

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

Well, it is that I took many fp posts out of MS record and analyzed them rather than that this reason will result in those fps.

Sep 12 '20 20:09 user12986714

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

Sep 12 '20 20:09 NobodyNada

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

However, I believe that some test sessions have been run and fp rate is low.

Sep 12 '20 20:09 user12986714

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

Not necessarily in this case — most false-positives on MS are normal English posts, which are what we want to avoid catching.

In that case, I think this is ready for merge (cc @ArtofCode-). If we run into problems we can always revert it.

On Sep 12, 2020, at 1:54 PM, user12986714 [email protected] wrote:

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

Sep 12 '20 20:09 NobodyNada

I ran some more tests today. It's looking a lot better, but we still have problems with:

Code. We probably should strip code blocks, but then we'll still have a lot of fp's due to posts with unformatted code.
Posts with lots of un-rendered whitespace. IMO we really should collapse repeated whitespace characters.
Those constants don't seem to be conservative enough; e.g. https://english.stackexchange.com/a/408724/106362 and https://hermeneutics.stackexchange.com/a/51104 are both caught, with entropies of 4.0233 and 5.6742 respectively.

Oct 13 '20 18:10 NobodyNada

SmokeDetector SmokeDetector copied to clipboard

Add reason post is likely nonsense

SmokeDetector
SmokeDetector copied to clipboard