langtest
langtest copied to clipboard
Explore MS promptbench
Explore the new tool released by Microsoft for evaluation of LLMs.
Brief description:
It consists of a wide range of LLMs and evaluation datasets, covering diverse tasks, evaluation protocols, adversarial prompt attacks, and prompt engineering techniques. As a holistic library, it also supports several analysis tools for interpreting the results. It is designed in a modular fashion, allowing to build evaluation pipelines for custom projects.
So, I think we should check what are the techniques they use to evaluate the models, as well as datasets they support, tasks, and analysis tools to interpret the results.
Github link: promptbench