SmokeDetector
SmokeDetector copied to clipboard
Add reason post is likely nonsense
This PR attempts to find out vandalism and gibberish by calculating the informational entropy of a given text. The constants used in this PR is conservative.
What's the standard deviation of entropy-per-char? 3.0 seems quite close to 2.6 to me, but it depends - that might be super unlikely or relatively likely; the standard deviation will reveal which it is.
Well, entropy per char for English + space IIRC is ~21...
That's... not what I'm asking. You have a comment in here that says "Average entropy per char in English is 2.6". If that's the average, what's the stddev?
The entropy values here are broken, every post gets caught.
Legit posts:
-
“I have seen the discussion about the Turkish Airlines COVID Cabin policy which makes little sense. Regardless though, does anyone know if they are enforcing it? I, like many, will be transferring at a European Airport on two tickets issued separately. I can't check my luggage all the way through from Istanbul to Malaga and can't exit customs to collect the luggage in Brussels (without a forced quarantine or denied entry)” - entropy per char of 0.2332
-
“Why all Indian rupee notes are accepted in Nepal and Bhutan, except 500 Rs and 1000 Rs? Why spare those two notes?” - entropy per char of 0.2485
Gibberish posts:
-
this this this this spamd dshdshdshdshhds - entropy per char of 0.4045
-
“test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test” - entropy per char of 0.4898
-
“gibberish dshdshdsaasdlaf,afdasfkkdafkafdkdfkfdskdsfksdkfksd.fk.sdfksfdkfk.fsdk.sdfksfdkfsdkfsdk” - entropy per char of 0.3817
-
“dfahdfhdsfsfkdjjsldfksdflfsdlkjfldskjlkfdsjklsfd/jldsfjlsdfjsdfakjsdafjkfsjkldfsaklfdklsalkfdsaklsdadsaklasfdkldfsaljkdfslkaldsfkdslflfsddfskjsdfllsfdaladsflfdsalsdalfksdklafkdsafdsafdsaklkldfsakfdkslaflklfdsakldfsalfdldsaflkasfldjkdfslsdklaflkdsfakjlfsadkljfsakljafsdlkjsdfjjfsdljasdladsfljkfdsjldfsjldsfjsdf“ - entropy per char of 0.4025
I think the entropy values need to be adjusted according to these results
A stat with 12405 fp posts on MS
>>> statistics.mean(result)
0.20483261275004847
>>> statistics.median(result)
0.20223865427238322
>>> statistics.stdev(result)
0.031230117152319384
So yes, I managed to mess up with the decimal point
Note: fp is defined as:
>>> def is_fp(post):
... fp_count = 0
... tp_count = 0
... for fb in post['feedback']:
... if fb[1].startswith("f"):
... fp_count += 1
... elif fb[1].startswith("t"):
... tp_count += 1
... return (fp_count - tp_count > 1) or ((fp_count > 0) and (tp_count ==0))
This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.
A stat with 12405 fp posts on MS
That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.
A stat with 12405 fp posts on MS
That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.
Well, it is that I took many fp posts out of MS record and analyzed them rather than that this reason will result in those fps.
@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?
@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.
However, I believe that some test sessions have been run and fp rate is low.
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.
Not necessarily in this case — most false-positives on MS are normal English posts, which are what we want to avoid catching.
In that case, I think this is ready for merge (cc @ArtofCode-). If we run into problems we can always revert it.
On Sep 12, 2020, at 1:54 PM, user12986714 [email protected] wrote:
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.
I ran some more tests today. It's looking a lot better, but we still have problems with:
- Code. We probably should strip code blocks, but then we'll still have a lot of fp's due to posts with unformatted code.
- Posts with lots of un-rendered whitespace. IMO we really should collapse repeated whitespace characters.
- Those constants don't seem to be conservative enough; e.g. https://english.stackexchange.com/a/408724/106362 and https://hermeneutics.stackexchange.com/a/51104 are both caught, with entropies of 4.0233 and 5.6742 respectively.