isso icon indicating copy to clipboard operation
isso copied to clipboard

spam filtering

Open posativ opened this issue 10 years ago • 21 comments

It's surprisingly difficult to find a usable general spam filter software. That's currently available:

  • DSPAM – requires a daemon, dafuq?
  • CRM114 – bloated, but there's a stripped down, abandoned ANSI C implementation with Python support, but doesn't compile on my system.
  • bayes classifier – most are > 4 years old, don't have a test suite etc.

Probably DIY: http://crm114.sourceforge.net/docs/classify_details.txt

posativ avatar Sep 10 '13 14:09 posativ

  • spam – BAYESIAN SPAM DETECTOR, but warn about spam -> ham poisoning

posativ avatar Oct 09 '13 16:10 posativ

You could give bogofilter a try as well.

noqqe avatar Oct 19 '13 08:10 noqqe

What about using an external service for this (like akismet)?

srijan avatar Feb 10 '14 21:02 srijan

External services can be implemented if they are not required for Isso (Akismet is a US service). Nevertheless, spam filtering should by default not rely on any third-party provider (as it defeats the purpose of self-hosting).

posativ avatar Feb 10 '14 23:02 posativ

Yeah. I meant as an option, not as default (of course).

srijan avatar Feb 10 '14 23:02 srijan

Perhaps something along the lines of this wordpress plugin http://web-profile.com.ua/wordpress/plugins/anti-spam-pro/

Simply puts a question that you have to answer in case javascript is disabled... otherwise you don't even notice it

Lux-Delux avatar Feb 27 '14 23:02 Lux-Delux

This plugin is quite useless as it fails when bots begin to interpret JS (which is not that hard, but requires some compution power I guess). But similar to the plugin, Isso is currently not affected by spam (my demo site does not receive spam e.g.) because most bots are not capable of evaluating JavaScript and if they do, they hopefully abort the computation because PBKDF2 takes too long.

However, a targeted attack which uses the pubic API might be an issue someday.

posativ avatar Feb 28 '14 10:02 posativ

I’m considering Isso (or Discourse) as an option to enable comments again on my Pelican blog.

I agree with @posativ that because of the matter of trust, this should be self-hosted. What I’m still worried about is how much a spam filter would hammer my poor little ARM server.

The reason why I disabled comments (and moved from the otherwise very nice Habari) is that spammers would effectively DoS my server, since the spam filter (Bayesean + honey pot) would just consume way too much CPU time.

If spam filtering can be done in a not too expensive local way or in a distributed way that can be trusted, I would be very happy to have comments (very likely with Isso) enabled again.

silverhook avatar Oct 18 '14 08:10 silverhook

I am not aware of any (real) spam, neither in my personal blog nor in the demo. @noqqe reported that he didn't receive a single spam comment in over a year, too.

That's probably because of Isso is still quite unknown and is written in JavaScript instead of a pre-rendered HTML snippet. Or the Js interpreter of typical spam robots is broken.

posativ avatar Oct 20 '14 13:10 posativ

Hi, I'm interested on contributing spam filtering. Captcha systems aren't enough since there are paid people doing manual spam, not only bots (personally, in my old blog I received a lot of spam, sometimes coming from real humans).

About the spam filtering system, we can use "support vector machines" instead of bayesian filters, which are relatively efficient after the training phase, so DOS attacks are improbable to be successful.

castarco avatar Oct 15 '15 13:10 castarco

Lately my site (running isso) has started to receive quite the onslaught of spam. I don't know how much is human-posted and how much is from javascript-aware automation stuff but either way, some of them are even gloating about "easiest captcha ever" on their comment, as if their comment is going to be seen by the public or would be indexed by a search engine. (Also why the hell haven't bots figured out that everyone has used rel="nofollow" for like 20 years?)

Anyway. Yeah. A plugin system would be great. I"m willing to spend some time working on one.

fluffy-critter avatar Dec 19 '19 07:12 fluffy-critter

External services can be implemented if they are not required for Isso (Akismet is a US service). Nevertheless, spam filtering should by default not rely on any third-party provider (as it defeats the purpose of self-hosting).

Would be great to have a plugin system to integrate with third-parties. As an alternative to Akismet, there is OOPSpam which is GDPR complaint.

onaralili avatar Jul 06 '21 06:07 onaralili

Anyway. Yeah. A plugin system would be great. I"m willing to spend some time working on one.

@fluffy-critter did you eventually come up with something?

ix5 avatar Feb 10 '22 21:02 ix5

I haven't had time/energy to work on anything, unfortunately.

fluffy-critter avatar Feb 10 '22 21:02 fluffy-critter

Hey guys: consider PoW as a simpler means of spam filtering:

  • https://git.sequentialread.com/forest/pow-captcha
  • https://mcaptcha.org/

taoeffect avatar Feb 10 '22 21:02 taoeffect

I haven't had time/energy to work on anything, unfortunately.

No worries, I was just curious. This is not too important anyway.


In general, a plugin API would be neat to have. Extending the signals system to trigger spam detection upon a new comment should be my idea.

ix5 avatar Feb 10 '22 21:02 ix5

Hey guys: consider PoW as a simpler means of spam filtering:

* https://git.sequentialread.com/forest/pow-captcha

* https://mcaptcha.org/

Those two look interesting, but heads up, they require modern browsers and wasm support.

ix5 avatar Feb 10 '22 21:02 ix5

Also I'm not sure what problem that actually solves, beyond making sure someone's idle on a page for a certain amount of time before they submit. Most of the spam I get appears to be submitted by humans who are paid money to defeat CAPTCHAs, as has been the case with most comment spam for at least the past decade.

fluffy-critter avatar Feb 10 '22 21:02 fluffy-critter

It would cut down on some spam, probably not all. You could increase the difficulty most likely in the settings, etc. It's simple to setup. There are tradeoffs with everything. If you really wanted to prevent all spam (and also some legitimate comments), you could charge micropayments over the lightning network. :P

taoeffect avatar Feb 10 '22 21:02 taoeffect

Hi, I just heard about this project and it looks nice! In the past I made my own project that was similar and I created the git.sequentialread.com/forest/pow-captcha for it. If you would like to try it out, it's hosted here:

https://sequentialread.com/now-with-comments/#sqr-comment-container

I have also seen spam from humans. There were some SEO spammers trying to register accounts on our gitea server and post links to their clients businesses. We were not able to stop them until we implemented a required invite token for registration :(

To be honest I'm not sure what to do about that kind of spam besides putting the comments into a moderation queue and having someone look at them.

My "pow-captcha" is not actually a captcha at all, I think it's just a bot deterrent. Unfortunately its also a deterrent for people who run customized browsers with anti-fingerprinting or new features disabled. I use it as bot deterrent for other things and I have seen my friends blocked by it because the privacy browser they use on their phone would not allow WebWorkers etc :(

I think unfortunately spam is always going to be impossible to stop automatically with high accuracy. Maybe for the next version of my site I will try out isso for comments but make it skip the moderation queue if the browser was able to solve the pow challenge.

making sure someone's idle on a page for a certain amount of time before they submit.

The animated gif on the ReadMe is sort of intentionally slowed down / it was recorded from a higher difficulty setting than the one I use on my site. I think you would have to pull out a cell phone from 8-10 years ago to see it go that slow on my site. On a new computer its so fast that it can barely even render the progress bar before its done.

ForestJohnson avatar Nov 30 '23 08:11 ForestJohnson

Yeah there's no way to automatically get rid of all spam, but it's still nice to have a means of being able to classify things to apply different moderation policies to them, and possibly be able to specifically whitelist known-good posters so their comments go up immediately.

fluffy-critter avatar Nov 30 '23 08:11 fluffy-critter