llms-benchmarking topic
cc_flows
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
ChemLLMBench
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
resta
Restore safety in fine-tuned language models through task arithmetic
parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, st...
cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
XMainframe
Language Model for Mainframe Modernization
BackdoorLLM
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
text-embedding-evaluation
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️