crewAI Submit tests to GAIA benchmark

GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc).

Paper: https://arxiv.org/abs/2311.12983

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board

Leaderboard: https://huggingface.co/spaces/gaia-benchmark/leaderboard

Apr 05 '24 11:04 ChristianWeyer

ha! very interesting, probably won't be able to do right now but would love to see how it would hold against those :)

Apr 05 '24 11:04 joaomdmoura

First place is currently AutoGen 😉

Apr 05 '24 14:04 ChristianWeyer

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 19 '24 12:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Aug 25 '24 12:08 github-actions[bot]

crewAI crewAI copied to clipboard

Submit tests to GAIA benchmark

crewAI
crewAI copied to clipboard