llm-evaluation-framework topic
promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command...
deepeval
The LLM Evaluation Framework
parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
MixEval
The official evaluation suite and dynamic data release for MixEval.
KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts