Open-Assistant Precheck user input before processing it

To ensure no abuse or spam the discord bot should precheck all input coming in from the user. I wrote a function to solve this problem but it can't get implemented yet because of rewriting the bot. This should get implemented as soon as rewrite is successful. Please suggest changes for adding checks or removing unnessesary ones and the code can probably be written way easier so if you know libraries for doing it better write them down. :)

from math import log2
import re

def check_user_input(
    text: str, 
    max_length: int = 256, 
    min_length: int = 1, 
    search_repetitions: bool = False, 
    max_randomness: float = 1.0,
    search_forbidden_words: bool = False, 
    search_forbidden_characters: bool = False,
    forbidden_words: list = [], 
    forbidden_characters: list = []
    ) -> bool:
    """ Validate user input to ensure no errors, spam, or abuse """
    if len(text) > max_length:
        return False
    if len(text) < min_length:
        return False
    if search_repetitions:
        max_counts = 5
        repetitions = re.findall(r"\b(\w+)\b \1\b", text)
        if len(repetitions) > max_counts:
            return False
    if max_randomness < 1.0:
        character_counts = {}
        for character in text:
            character_counts[character] = character_counts.get(character, 0) + 1
        total_characters = sum(character_counts.values())
        # Shannon entropy of the text
        entropy = 0.0
        for count in character_counts.values():
            probability = count / total_characters
            entropy += probability * log2(probability)
        if entropy > max_randomness:
            return False
    if search_forbidden_words and any(word in text for word in forbidden_words):
        return False
    if search_forbidden_characters and any(character in text for character in forbidden_characters):
        return False
    return True

Dec 29 '22 15:12 shroominic

I think a good place for spam / abuse detection is the backend, not the bot, probably best in the API. We could implement those as pydantic validators directly in the payload definitions

Dec 29 '22 21:12 yk

that said, some restrictions make sense, such as max length, but IMO spam is so multi-faceted, simply putting in a limit on entropy or something like that seems quite arbitrary and also probably causes false-positives. I think we need a different method of detecting spam. we already have users voting and assigning labels, which seems a much more promising approach.

Dec 29 '22 21:12 yk

Can this be closed if we are definitely going with user voting/labels to remove spam for now?

Jan 08 '23 11:01 olliestanley

yes

Jan 10 '23 21:01 yk

Open-Assistant Open-Assistant copied to clipboard

Precheck user input before processing it

Open-Assistant
Open-Assistant copied to clipboard