evals topic

List evals repositories

phoenix

8.2k

Stars

676

Forks

8.2k

Watchers

AI Observability & Evaluation

ml-observability

agentops

5.2k

Stars

505

Forks

5.2k

Watchers

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and Ca...

langfuse

19.2k

Stars

1.9k

Forks

19.2k

Watchers

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

vivaria

53

Stars

15

Forks

Watchers

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

lmnr

2.5k

Stars

156

Forks

2.5k

Watchers

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

rag-evaluator

21

Stars

13

Forks

Watchers

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

mastra

11.2k

Stars

529

Forks

Watchers

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

promptpex

143

Stars

20

Forks

143

Watchers

Test Generation for Prompts

HourVideo

161

Stars

4

Forks

161

Watchers

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

1-hour-video-language-understanding

benchmark-dataset

egocentric-videos

stress-tests

19

Stars

5

Forks

19

Watchers

A collection of particularly difficult test scenarios for evaluating browser-use.