`logit_bias` no longer affects `logprobs`
Hi @chawins,
Thank you for this interesting work! I was wondering how this attack would work now that logit_bias no longer affects logprobs [1][2], as we can no longer use the trick the logprobs of the target tokens (if they don't appear in the top-5). Would love to know your thoughts on this change; thanks!
Apoorva
[1] Logit_bias does not work now- OpenAI Community [2] x.com- Brian Huang
Yeah, we're aware of the issue. I have an ad-hoc idea for how to fix this, and I'm currently testing it out. In short, instead of maximizing prob of "Sure" token, you can try to minimize prob of tokens that are not "Sure" or other refusal tokens. The preliminary result we have is pretty promising, but I want to test it out on more models/APIs.