defang
defang copied to clipboard
genkit eval changes
Description
Using genkit evaluation. We add a testing framework to check whether changes to LLMs or tool name/descriptions will yield the same tool calls when given the same input. The current_evaluation.json file contains the current test run and pass rates. The Makefile (make genkit-help) has been updated with information on how the workflow is expected to run as well as a README.md. Any changes to LLM or tools/descriptions should have the same or better pass rate.
Linked Issues
fixes #1486
Checklist
- [x] I have performed a self-review of my code
- [ ] I have added appropriate tests
- [ ] I have updated the Defang CLI docs and/or README to reflect my changes, if necessary