Litmus Integration (https://github.com/google/litmus)
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application Development. It provides a robust platform with a user-friendly UI for streamlining the process of building and assessing the performance of your LLM-powered applications on Google Cloud.
Features
- Automated Test Execution: Submit test runs using pre-defined templates to evaluate responses against golden answers using AI.
- Flexible Test Templates: Define and manage test templates specifying the structure and parameters of your tests. You can choose between two template types: "Test Run" for single-turn interactions and "Test Mission" for multi-turn interactions where the LLM generates its requests.
- User-Friendly Web Interface: Interact with the Litmus platform through a visually appealing and intuitive web interface.
- Detailed Results: View the status, progress, and detailed results of your test runs.
- Advanced Filtering: Filter responses from test runs based on specific JSON paths for in-depth analysis.
- Performance Monitoring: Track the performance of your responses and identify areas for improvement by using AI.
- Multiple LLM Evaluation Methods: Leverage a variety of LLM evaluation methods:
- Custom LLM Evaluation with Customizable Prompts: Use an LLM to compare actual responses with expected (golden) responses, utilizing flexible prompts tailored to your evaluation needs.
- Ragas Evaluation: Apply Ragas metrics, including answer relevancy, context recall, context precision, harmfulness, and answer similarity.
- DeepEval Evaluation: Leverage DeepEval's LLM-based metrics, such as answer relevancy, faithfulness, contextual precision, contextual recall, hallucination, bias, and toxicity.
- Proxy Service for Enhanced LLM Monitoring: Analyze your LLM interactions in greater detail with the optional proxy service, capturing comprehensive logs of requests and responses.
- Cloud Integration: Leverage the power of Google Cloud Platform (Firestore, Cloud Run, BigQuery, Vertex AI) for efficient data storage, execution, and analysis.
Enabling Ragas in Litmus
By default, Ragas evaluation is disabled in Litmus. To enable it, you need to modify your test templates:
- Edit your test template:
- In the Litmus UI, navigate to the "Templates" page and click the "Edit" button next to the template you want to modify.
- Enable Ragas in the "LLM Evaluation Prompt" tab:
- Check the checkbox for Ragas.
- Save your template:
- Click the "Update Template" button to save your changes.
Using Ragas
Once Ragas is enabled, Litmus will automatically use it to evaluate LLM responses for test runs that utilize the modified template. The results are embedded within the assessment field of the test case:
{
"status": "Passed",
"response": {
"output": "This is the answer"
},
"assessment": {
"ragas_evaluation": {
"answer_relevancy": 1.0,
"context_recall": 1.0,
"context_precision": 1.0,
"harmfulness": 0.0,
"answer_similarity": 1.0
}
}
}
Configuring and Extending Ragas
Currently, Litmus utilizes a predefined set of Ragas metrics, including answer relevancy, context recall, context precision, harmfulness, and answer similarity. Extending this set or adjusting metric thresholds would require code modifications within the worker service.
For instance:
- Adding new metrics: To include additional Ragas metrics like
Aspect CritiqueorAnswer Correctness, you would need to modify theragas_metricslist in theragas_eval.pyfile within the worker service code. - Adjusting thresholds: To modify the default thresholds for determining pass/fail, you would need to adjust the metric objects within the
ragas_eval.pyfile.
Note: These modifications involve code changes and require rebuilding and redeploying the worker Docker image.
Learn more about Limus:
- Evaluating with Ragas: https://google.github.io/litmus/evaluate-ragas
- Docs: https://google.github.io/litmus/
- Github Library: https://github.com/google/litmus