PyRIT icon indicating copy to clipboard operation
PyRIT copied to clipboard

DOC Uncensored model

Open romanlutz opened this issue 10 months ago • 1 comments

We should document how to get an uncensored model. @KutalVolkan has thoughts 🙂

Discussed in https://github.com/Azure/PyRIT/discussions/370

Originally posted by mantmishra September 12, 2024

Going with OpenAI GPT4o as the attacker LLM as it's the highest ranked LLM model in most benchmarks. However, it refuses to do prompt injections in almost all strategies citing "It's not able to assist with the task" - likely due to safeguards in place by OpenAI. Finetuning the model with adversarial examples also doesn't work as OpenAI endpoint throws the error "The job failed due to an invalid training file. This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies." Has anyone found a workaround for this issue? What alternate model can be used as the attacker LLM that doesn't have as many safeguards in place?

romanlutz avatar Mar 14 '25 23:03 romanlutz

Hello Everyone,

I've shared my thoughts on this topic in the PyRIT Discord, feel free to join the discussion here.

Unfortunately, I haven’t had much time to pursue this further as I’m currently focused on building a PoC for work, and university is keeping me busy too. My last exam is on April 4, after which I’ll have more time to tackle this properly and write a detailed blog post on the topic.

If anyone else wants to work on this before then, feel free to jump in! 😊

KutalVolkan avatar Mar 16 '25 11:03 KutalVolkan