llm-benchmarking topic
llm4regression
Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
LLM-Research
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
pint-benchmark
A benchmark for prompt injection detection systems.
LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessmen...
fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Awesome-Code-Benchmark
A comprehensive code domain benchmark review of LLM researches.
enterprise-deep-research
Salesforce Enterprise Deep Research
confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
BizFinBench
A Business-Driven Real-World Financial Benchmark for Evaluating LLMs