prompttools
prompttools copied to clipboard
Robustness evaluation
🚀 The feature
Request from potential user: "There are two main aspects, 1) adjusting prompts that changing semantic words does not trigger hallucination, 2) the prompt itself is such that LLM doesnt slip away from instruction"
Idea: for (1) use prompt templates to substitute words, run evals to check semantic similarity of all results. For (2) use auto-evaluation given instruction, prompt, and response to determine if the LLM followed instructions.
Motivation, pitch
We got this request from a potential user, and also robustness is a common concern in LLM evaluation
Alternatives
No response
Additional context
No response
Hey, I am working on the issue and I have generated 2 sample scripts - one involving prompt substitution and the other for auto evaluation. I am using Promptbench in both scripts. Can you please guide me as to how to integrate the scripts into your experiments? I am joining your discord group wherein we can discuss this issue in detail.