DOC Uncensored model

Open romanlutz opened this issue 10 months ago • 1 comments

We should document how to get an uncensored model. @KutalVolkan has thoughts 🙂

Discussed in https://github.com/Azure/PyRIT/discussions/370

^{Originally posted by mantmishra September 12, 2024}

Going with OpenAI GPT4o as the attacker LLM as it's the highest ranked LLM model in most benchmarks. However, it refuses to do prompt injections in almost all strategies citing "It's not able to assist with the task" - likely due to safeguards in place by OpenAI. Finetuning the model with adversarial examples also doesn't work as OpenAI endpoint throws the error "The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies." Has anyone found a workaround for this issue? What alternate model can be used as the attacker LLM that doesn't have as many safeguards in place?

Mar 14 '25 23:03 romanlutz

Hello Everyone,

I've shared my thoughts on this topic in the PyRIT Discord, feel free to join the discussion here.

Unfortunately, I haven’t had much time to pursue this further as I’m currently focused on building a PoC for work, and university is keeping me busy too. My last exam is on April 4, after which I’ll have more time to tackle this properly and write a detailed blog post on the topic.

If anyone else wants to work on this before then, feel free to jump in! 😊

Mar 16 '25 11:03 KutalVolkan