genkit
genkit copied to clipboard
[Evals] Evaluation docs improvements
Autogenerated from Gemini:
This text reveals several areas where the documentation for Genkit, particularly around evaluation, could be improved:
* **Clarify how evaluators are standardized.** The text acknowledges that while evaluation metrics like Faithfulness and Answer Relevance are becoming standardized, their implementation can vary. The documentation should provide more concrete information on this, perhaps by:
* Giving specific examples of how implementations can differ.
* Offering guidance on choosing the best implementation for different use cases.
* Explaining how Genkit handles these variations to ensure consistency.
* **Provide more guidance on quantifying output variables.** The text mentions that users can define custom evaluation metrics, but it should offer more support on how to do this effectively. Consider adding:
* Examples of quantifying different types of outputs.
* Best practices for designing custom metrics.
* A step-by-step guide to implementing custom evaluators.
* **Expand on the scope of pre-defined evaluators.** Users need a clearer understanding of what metrics like "Maliciousness" actually measure. The documentation should:
* Provide detailed explanations of each pre-defined metric.
* Clarify which RAGAS metrics are included in Genkit.
* Offer examples of how these metrics are used in practice.
* **Improve the description of "Maliciousness"**. The current explanation is vague. The documentation should clearly define what constitutes "maliciousness" in the context of LLMs and how the evaluator identifies it.
* **Clarify the analogy to testing.** While the text likens evaluators to E2E testing, it could be more explicit about how they fit into the development process. This could involve:
* Explaining when and how to use evaluators during development.
* Providing examples of how evaluators can help identify regressions.
* Discussing how evaluators can be integrated into a CI/CD pipeline.
By addressing these points, the documentation can better support users in understanding and effectively using Genkit's evaluation features.
Context: https://discord.com/channels/1255578482214305893/1281391213550895124/1282325935038926868
Also, making the example code actually compile would be nice.
Made the code compile: https://github.com/firebase/genkit/pull/1497/files