llms-benchmarking topic

List llms-benchmarking repositories

cc_flows

31
Stars
2
Forks
Watchers

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ChemLLMBench

119
Stars
5
Forks
Watchers

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

chem-bench

48
Stars
5
Forks
Watchers

How good are LLMs at chemistry?

resta

25
Stars
1
Forks
Watchers

Restore safety in fine-tuned language models through task arithmetic

parea-sdk-py

74
Stars
6
Forks
Watchers

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

CompBench

30
Stars
1
Forks
Watchers

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, st...

cobbler

15
Stars
1
Forks
Watchers

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

XMainframe

42
Stars
3
Forks
Watchers

Language Model for Mainframe Modernization

BackdoorLLM

35
Stars
4
Forks
Watchers

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

text-embedding-evaluation

15
Stars
1
Forks
Watchers

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️