antispam-bee icon indicating copy to clipboard operation
antispam-bee copied to clipboard

Collect spam data in a smart way

Open schlessera opened this issue 7 years ago • 15 comments

Anonymously collect non-detected spam comments.

What data to collect:

Comments that were not detected as spam and for which the site user manually clicked the "Spam" button.

When to collect:

When the site user first clicks this "Spam" button, we should ask the permission to anonymously send the comment data to a centralized database, in order to improve Antispam Bee.

How to collect:

At first, send to a HTTPS endpoint that stores everything in a simple database (probably NoSQL). We may need to evaluate a more scalable solution in the future. The collected data must not contain any mention of the sender or information about their user or system. It should contain as much information as possible about the actual spam content and where it originated.

schlessera avatar Jan 29 '17 11:01 schlessera

are we or the user allowed to do that? are there some copyright/law issues? and if we do it everytime there could be some false positives because someone dislikes an user and marks them as spam or its only because someone posted a comment several times by misklicking or caching issues etc.

but i agree if we are allowed to (no law issues) then we should make it simpler to submit spam

timse201 avatar Jan 29 '17 15:01 timse201

I think, its a great idea. We should also include false positives.

I do not really see legal issues. In my understanding, if someone posts a comment, he gives the website owner the right to publish it. But honestly, I do not know how far this right can be stretched.

there could be some false positives

Yes, but right now, we have the same issue with our Google document. I think its worth a shot.

There should be an option in the settings like (send always, never send), maybe instead but as a addition to the question "do you want to send this specific comment?" to guarantee a quicker work flow.

websupporter avatar Jan 29 '17 15:01 websupporter

An alternative would be to add a separate button besides the Spam & Trash buttons. Something like Send for Analysis or similar. If they just want to get rid of their uninteresting newsletters, they will probably not click on Send for Analysis for these...

schlessera avatar Jan 30 '17 08:01 schlessera

And, yes, the original idea was to ask for permission once on clicking Spam and then have this be the new default.

schlessera avatar Jan 30 '17 08:01 schlessera

We could use the transformation action hooks comment_unapproved_to_spam and comment_approved_to_spam or we could provide a button / action link for this.

Possible problems: Privacy concerns (IP, Mail, Content, etc. from Comments) are submitted to us (or a Third-Party-Service like Google Forms).

This feature needs consent from the user: https://developer.wordpress.org/plugins/wordpress-org/detailed-plugin-guidelines/#7-plugins-may-not-track-users-without-their-consent

Zodiac1978 avatar Apr 12 '20 10:04 Zodiac1978

In my opinion the best way to collect non-detected spam would be to add a link alongside “Mark as spam” — something like “report to Antispam Bee”. When a user clicks that link, they'll have to confirm that they are about to disclose the comment and its metadata to the ASB team for further investigation and to improve ASBs filters before its sent.

Bildschirmfoto 2020-04-12 um 12 44 32

krafit avatar Apr 12 '20 10:04 krafit

To get even more data, we could use the action hooks if someone marks a comment as spam and then ask for the data (like PoEdit does this): Bildschirmfoto 2020-04-12 um 12 49 22

With an opportunity to opt-in to have this as the default.

Zodiac1978 avatar Apr 12 '20 10:04 Zodiac1978

I thought about an opt-in, but I didn't like the privacy implications of having this as a default for everyone after someone opted-in. But we could handle the opt-in the way PoEdit does, by handling it on a per user basis. This way every user has the opportunity to give informed consent before sharing data (for the first time).

krafit avatar Apr 12 '20 11:04 krafit

If we stay with our workflow (using the Google Form) we could pre-fill the form like this:

https://docs.google.com/forms/d/e/1FAIpQLSeQlKVZZYsF1qkKz7U78B2wy_6s6I7aNSdQc-DGpjeqWx70-A/viewform?c=0&w=1&entry.437446945=name%20of%20the%20commenter&entry.462884433=IP&entry.1346967038=Host&entry.121560485=email%20of%20the%20commenter&entry.1210529682=website%20of%20the%20commenter&entry.1837399577=content%20of%20the%20comment

URL encoded data.

The user just needs to hit the "Send" button at the end of the page.

Zodiac1978 avatar Jul 21 '20 12:07 Zodiac1978

If someone wants to test this feature: Here is a working addon plugin:

<?php
/**
 * Plugin Name: Report Spam
 * Description: Addon for Antispam Bee to report spam.
 * Plugin URI:  https://torstenlandsiedel.de
 * Version:     1.0
 * Author:      Torsten Landsiedel
 * Author URI:  http://torstenlandsiedel.de
 * Licence:     GPL 2
 * License URI: http://opensource.org/licenses/GPL-2.0
 */

if ( ! defined( 'ABSPATH' ) ) {
	exit; // Exit if accessed directly.
}

/**
 * Add comment action link to report spam to ASB
 *
 * @param array   $actions Array of actions.
 * @param comment $comment Comment object.
 */
function add_report_comment_action_link( $actions, $comment ) {

	// URLencode comment data.
	$name    = rawurlencode( $comment->comment_author );
	$email   = rawurlencode( $comment->comment_author_email );
	$ip      = rawurlencode( $comment->comment_author_IP );
	$host    = rawurlencode( gethostbyaddr( $ip ) );
	$url     = rawurlencode( $comment->comment_author_url );
	$content = rawurlencode( $comment->comment_content );
	$agent   = rawurlencode( $comment->comment_agent );

	// Build action link.
	$target = ' target="_blank" ';
	$rel    = ' rel="noopener noreferrer" ';
	$href   = 'href="https://docs.google.com/forms/d/e/1FAIpQLSeQlKVZZYsF1qkKz7U78B2wy_6s6I7aNSdQc-DGpjeqWx70-A/viewform?c=0&w=1&entry.437446945=' . $name . '&entry.462884433=' . $ip . '&entry.1346967038=' . $host . '&entry.121560485=' . $email . '&entry.1210529682=' . $url . '&entry.1837399577=' . $content . '&entry.372858475=' . $agent . '" ';

	$action  = '';
	$action .= "<a $target $href $rel>";
	$action .= __( 'Report to Antispam Bee', 'antispam-bee' );
	$action .= '</a>';

	$actions['report_spam trash'] = $action;

	return $actions;
}
add_filter( 'comment_row_actions', 'add_report_comment_action_link', 10, 2 );

Zodiac1978 avatar Jul 22 '20 21:07 Zodiac1978

Bildschirmfoto 2020-07-22 um 23 21 17

Zodiac1978 avatar Jul 22 '20 21:07 Zodiac1978

Includes Comment User Agent as a new item (form is already extended for this) and it gets the host from the IP.

Zodiac1978 avatar Jul 22 '20 21:07 Zodiac1978

there could be some false positives

We could add a checkbox at the end of the form "o This is a false positive and no spam" which could be checked before sending the form. Although I don't think many people would use it ...

Zodiac1978 avatar Jul 23 '20 09:07 Zodiac1978

With regard to https://torstenlandsiedel.de/2021/01/31/antispam-bee-braucht-eure-juristische-hilfe/:

a) self hosted instead of google for sure (or at least a SaaS based within EU and proper data processing contract) b) if consent is given by the submitter, everything is fine. Can the consent be withdrawn? Legally yes, factually no: once it's worked with, we of course could remove the data from the list of submittance, yet the evidence out of the case remains. At least as long as the submittance is taken care of in a timely manner ;-). c) regarding the entity receiving: Indeed the biggest flaw as we are acting as a GbR which includes the chance that any random member of the GbR could be sued, fined, … This is the point where a discussion about changing the legal framework for the entity should take place. To be focused on the matter, I'ld suggest to seperate this from this issue. Happy to start this indeed internal discussion on our slack channel.

to get hands-on: The link "Report to Antispam Bee" should ideally give a modal with all neccessary information* e.g. which data is submitted, where it will be stored an for which amount of time, who will have access to it and how it will be purged as well as a note that the data is provided on a consensual base. At last each a confirm / decline button which than submits the data to a GDPR compliant server for further processing.

*let me draft something later this week

stkjj avatar Feb 01 '21 07:02 stkjj

For further discussion a text for the modal (de/en):

Vielen Dank dass Du uns hilfst Antispam Bee besser zu machen.

Du bist gerade dabei den Kommentar von [Name des Kommentators] mit dem Inhalt [Inhalt des Kommentars] an uns zu melden, da Du es für nicht erkannten Spam hälst. Folgende Daten haben wir außerdem in dem Kommentar gefunden, die wir für die Auswertung und die Heuristik von Antispam Bee verwerten werden:

  • [IP Adresse]
  • [Host]
  • [UserAgent]
  • [eMail Adresse des Kommentator]
  • [Webseite des Kommentators]

Wir werten diese Daten [automatisiert|manuell] aus um damit die Spamerkennung von Antispam Bee zu verbessern. Sofern wir mehrfach gleichlautende Meldungen über einen Spamer bekommen, nutzen wir diese Daten auch um damit Blacklist Updater zu aktualisieren. Die Daten werden von uns in den nächsten x [Stunden|Tagen] verarbeitet und danach automatisch gelöscht. Für den Zeitraum der Verarbeitung werden die Daten ausschliesslich auf Servern mit Standort Deutschland gespeichert. Lediglich das Entwicklerteam von Antispam Bee hat darauf Zugriff. Um den Prozess schlank zu halten, bekommst Du von uns keine weitere Rückmeldung über die Verarbeitung, Speicherung oder Löschung, aber unser Dank wird Dir gewiss sein.

Wenn Du mit der Übermittlung dieser Daten einverstanden bist, kannst Du sie mit dem Button unten absenden. Button: Verwerfen / Button: Absenden


Thank you for helping us to improve Antispam Bee.

You are about to report the comment by [commenter name] with the content [content of the comment] to us, because you believe it is unrecognized spam. We also found the following data in the comment, which we will exploit for Antispam Bee's evaluation and heuristics:

  • [IP address]
  • [Host]
  • [UserAgent]
  • [eMail address of the commenter]
  • [website of the commenter]

We evaluate this data [automated|manually] to improve the spam detection of Antispam Bee. If we receive multiple identical messages about a spammer, we also use this data to improve Blacklist Updater. The data will be processed by us in the next x [hours|days] and then automatically deleted. For the period of processing, the data is stored exclusively on servers located in Germany. Access to this data is only granted to our developer team. To keep the process lean, you will not receive any further feedback from us about the processing, storage or deletion, but pls receive our thanks for your help.

If you agree to submit this data, you can send it using the button below. Button: Discard / Button: Submit

stkjj avatar Feb 01 '21 13:02 stkjj