automate spam detection
This PR attempts to automate the spam detection process for Job, Event, Codebase and MemberProfile objects using an external LLM service.
LLM Spam Detection Process
- A SpamModeration record with status
SCHEDULED_FOR_CHECKis stored on everyJob,Event,Codebase, submission andUser(SpamModerationobject is attached to the associatedMemberProfile) creation. - A decoupled external service asuworks/comses.spamcheck queries for these
SpamModerationobjects (api/spam/get-latest-batch/), analyzes them for spam and submits a spam report toapi/spam/updatefor each one of them. - The handler for the
api/spam/updateon the CoMSES side updates the correspondingSpamModerationobject according to the LLM report from the external service.
Starting the LLM Spam Detection Process
The external service asuworks/comses.spamcheck is deployed on an existing JetStream2 instance which is unshelved before the spam check workflow is triggered and shelved automatically after it is done by the following management command:
./manage.py curator_llm_spam_check
# Following flags are available for troubleshooting purposes:
--skip-changing-instance-state # will skip changing the status of the JetStream2 instance
--skip-shelving-when-done # will start the JetStream2 instance, execute the `CheckSpamWorkflow`, but not shelve it
Environment & Secrets
Following environment variables must be set:
LLM_SPAM_CHECK_API_URL=http://<JetStream2 instance IP>:8001
LLM_SPAM_CHECK_JETSTREAM_SERVER_ID=<JetStream2 instance ID>
JetStream2 Credentials
can be found here: https://js2.jetstream-cloud.org/identity/application_credentials/
secrets/llm_spam_check_jetstream_os_application_credential_secret
secrets/llm_spam_check_jetstream_os_application_credential_id
X-API-Key header for the API
Access to api/spam/update and api/spam/get-latest-batch routes is protected by the X-API-Key header verification.
The key should be set in secrets/llm_spam_check_api_key
ALLOWED_HOSTS
The IP of the JetStream2 instance must be added to Django's ALLOWED_HOSTS
we would need to add the jetstream instance ip to ALLOWED_HOSTS
just missing the migration for SpamModeration.status, which I think should turn anything with "unreviewed" status into "spam_likely" status.
I'll read through it again but it seemed all good on a first pass, besides having a way to kick off the process
Regarding the way to start the process from CoMSES side:
- start the instance with openstack-cli
- poll the service until
HEALTHY - keep polling the service untill the workflow is
COMPLETED - shutdown instance with openstack-cli
something like this?
@asuworks I just remembered there was some additional cleanup I wanted to do eventually with the spam stuff. This might be a good place to get that done if you are up for it. comses/planning#249. Namely the second point (refactoring the serializer mixin to actually be just a mixin)
added "one-click" install of the comses.spamcheck with ansible: https://github.com/asuworks/comses.spamcheck/tree/main/deploy
The script does the following:
- basic server configuration: harden ssh, fail2ban
- install ollama and load llama3.1, llama3.2
- clone and run
comses.temporal - clone and run
comses.spamcheck
After the ansible playbook is done, the management commands from CoMSES should be able to trigger CheckSpamWorkflow.