deepeval icon indicating copy to clipboard operation
deepeval copied to clipboard

The LLM Evaluation Framework

Results 49 deepeval issues
Sort by recently updated
recently updated
newest added

**Describe the bug** RagasMetric breaks when used with dataset **To Reproduce** dataset.evaluate([RagasMetric]) Errors: TypeError: RagasMetric.a_measure() got an unexpected keyword argument '_show_indicator' and another self.llm.set_run_config(run_config) ^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'GPTModel' object has no...

**Is your feature request related to a problem? Please describe.** I'm trying to evaluate a local LLM model using Exllamav2 and Deepbench's support for the [MMLU dataset](https://docs.confident-ai.com/docs/benchmarks-mmlu). Unfortunately the current...

A lot of friction between generating goldens -> loading test cases for evaluation.

- Implemented YAML-based configuration loading for evaluation settings. - Added files - `deepeval/metrics/registry.py` to map metric names to class objects - `deepeval/metrics/loader.py` to load metrics from yaml and initalize -...

See this code: [ ![Screenshot 2024-05-06 192539](https://github.com/confident-ai/deepeval/assets/108796323/51180459-ea64-44f3-bd18-b44b0b5c2c08) ](url) ![Screenshot 2024-05-06 192553](https://github.com/confident-ai/deepeval/assets/108796323/b6158846-5da7-485f-bc5c-4d8a7270616a) `import json import asyncio from deepeval.metrics import AnswerRelevancyMetric, SummarizationMetric, HallucinationMetric from deepeval.test_case import LLMTestCase from deepeval import assert_test import...

It's my belief that the cache can be dramatically simplified, and made more reliable, by using the Python "diskcache" library. DiskCache handles so much: * locking * reliability in the...

When i follow the example on this page: https://docs.confident-ai.com/docs/metrics-introduction and try to use Mistral-7B as evaluation-model, i always get this error when running the exact code in the tutorial. **It...

the deepeval cli command currently always exits with exitCode 0 - even if the tests fail - which makes is hard to handle failed tests in automated workflows/pipelines. For example...