Create script for admins to manage "spammy phrases"
Problem
Following #6510, the only way to manage the list of "spammy phrases" will be by directly editing the DB. This requires special access (which limits who can do it), and is potentially dangerous (you don't want to mess up that DELETE query).
Description
Alternatively to https://github.com/openstreetmap/openstreetmap-website/issues/6511 (which proposes building a UI), we could have a simple Rake task that imports phrases from a text file. Then admins can automate using chef.
This was proposed by @Firefishy at https://github.com/openstreetmap/openstreetmap-website/pull/6549#issuecomment-3580444354
@Firefishy @tomhughes - Any preferences as to the following?
- Location of the file: should we make it a standard location as opposed to a path given as a parameter? Thinking that if the file is provided via chef, a standard location should work. If so, what should this location be?
- Format of the file: I'm thinking text, one phrase per line. Any reason to do differently?
- "Destructiveness" of the update: on running the task, should the table
spammy_phrasesbe truncated, to be fully replaced by the contents of the file?
As I said the problem with truncating is that every time you run it you'll wind up allocating new IDs to the new records, and new creation times, which isn't ideal.
I don't understand the overall strategy here.
- If the list of words is always going to come from a static data file, and the spam scorer doesn't do any joins (it's just doing the equivalent of a
SELECT * from spammy_phrasesat the moment) then perhaps we should be using frozen_records? This will load them on app boot. It's what we do for the communities. - If admins are going to be able to add/remove phrases individually, through some sort of UI, or if they are somehow being used in joins or other queries, then it makes sense to have a database table.
But from what I understand, we're trying to make the half-way house - the phrases are in the db, but then we want to just load them read-only from a data file. Is that right?
I did question the whole logic in the other PR but other people seemed keen and it didn't really seem worth spending time fighting over and I'm assuming there's no plan to add a significant number of entries or performance would definitely become untenable.
I can't remember what the history of this whole feature was but in general I consider the whole approach of trying to build a set of "bad words" to be hopeless tilting at windmills.
we're trying to make the half-way house - the phrases are in the db, but then we want to just load them read-only from a data file
Ugh, you are right. I should have stepped back for a moment when considering those proposals. That's on me.
I consider the whole approach of trying to build a set of "bad words" to be hopeless tilting at windmills.
I agree that it's not great. However my understanding is that it's still helpful. At https://github.com/openstreetmap/openstreetmap-website/pull/6044#issuecomment-3385621582, @firefishy said:
Yes, this will help to reduce the 100s of spam accounts that are created each day. This isn't the ultimate solution, but an intermediate measure. It should catch around 15% of the spam accounts.
I understand that the "15%" figure is an estimate and I don't intend to take it literally. What I do take from the comment is that the feature can lead to a noticeable reduction in the manual anti-spam work, while other solutions are studied.
I don't have access to the data or the admin/moderation interfaces, so I'm flying blind here and can only go with what those with access say. If an admin says that this can land us a noticeable reduction, that's a relatively concrete data point that I can work with. It's also a relatively simple thing to implement, so it scores just ok under "High Impact" and very well under "Low Effort". To me this sounds good enough to give it a go.
A specific counter-argument is that the list of phrases could grow too big. I understand that. My thinking is that this isn't a user-facing feature, but one for the admins. If and when things get out of hand, admins can rein it in themselves. With any luck, we'll have other solutions ready by then. This is only a temporary stopgap to help reduce actual, measurable, manual work that is currently taking place.
Honest questions: does my logic make sense? Are there other opinions, feedback, or data points to contradict it or make a different argument? I am happy to ditch the whole thing if it really doesn't make sense.
Should this feature stay, there's the issue of the implementation detail. I would be happy to revert the migration and return to just holding the phrases in memory. Before that I have to ask: what would admins, as its users, prefer for an interface?
- A web interface, accessible by admins.
- Load from a file.
- Both? Other?
I don't have a technical opinion on how the backend is best implemented (DB / Data File / YAML / etc), I only care about the outcome in having a managed list of words influence the spam score of user profiles. Ultimately higher spam scores leading to more SEO spam users being automatically suspended after crossing the spam score threshold.
I'd preference would be a file (managed via chef) or rake task (triggered via chef) which updates the list, rather than a UI.
I have WIP standalone code which builds a list of spammy words based on what SEO spam users have been suspended (or deleted) in the past. The code in future also be modified to produce regex or bayesian filter style word/score list.
Wouldn't a chef-managed file mean that the list is public or are you thinking about getting the list itself from somewhere else and just have chef apply it?
There is a separate private part of chef though that's mostly (possibly only) data bags but we could use one one of those.
Unless anyone disagrees (and please do voice it), I intend to leave this as is and move on to a different piece of work.
On one hand, my implementation is not great as discussed above. There's no point in using the DB when we don't need a web interface. A static file would have sufficed and would probably be more performant.
On the other hand, the feature is already in use. Admins have added lists of phrases to the DB, and these are being used for spam scoring as I write these lines. Let's wait and see how this works out. Perhaps it's completely useless and we decide we may as well scrap the whole feature. Or perhaps we think it's successful, and then we can look into potential improvements, which may include moving to in-memory data.
Additionally, I experimented a bit with our Chef config, to see what it would take to get this working with a "data bag", etc. I cannot get the thing to run (issue filed at https://github.com/openstreetmap/chef/issues/816), so that would delay me even longer. I don't think it's worth it at this time.