llm-evaluation-toolkit topic
langtest
Deliver safe & effective language models
athina-evals
Python SDK for running evaluations on LLM generated responses
just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model promp...