Improving the Greedy Coordinate Gradient implementation
The existing implementation under pyrit.auxiliary_attacks has lots of potential for improvement. Below are some possible directions to explore:
Over the past serveral months, I've been exploring and applying GCG algorithms. Based on what I've learned, I’d like to suggest taking a look at this project as a potential reference to improve the GCG implementation in PYRIT: https://github.com/GraySwanAI/nanoGCG#.
From what I can tell, PYRIT’s GCG implementation might have followed the original author's approach. However, the original implementation has some limitations — it’s quite slow and not very user-friendly. Additionally, since it relies on the outdated FastChat library, it's difficult to extend to newer or alternative models.
I'm happy to help with further improvements if needed!
From what I understand nanoGCG is for causal models or am I misremembering?
Nevermind, I just came to that conclusion from a cursory glance a few days ago because of
nanoGCG is a lightweight but full-featured implementation of the GCG (Greedy Coordinate Gradient) algorithm. This implementation can be used to optimize adversarial strings on causal Hugging Face models.
My main worry with nanoGCG from a quick look is that it's for a single model. Looks like I wasn't the first to think of this problem and someone wrote the code: https://github.com/GraySwanAI/nanoGCG/issues/32 (but it's just for 2 models, not any number of models)
nanoGCG also doesn't seem to support GCG with multiple prompts at the same time 🙁 https://github.com/GraySwanAI/nanoGCG/issues/21
Hmm with all that in mind it's probably better to stick with the GCG implementation as is and clean up/improve. That said, there's no harm in taking inspiration from the other versions in terms of input configuration, getting rid of fastchat, and other dimensions they have apparently improved (as long as it doesn't break multi-model and multi-prompt).