Feat: Framework/Module for proper A/B testing of prompts within opencode
The title says it all, I have nothing on this just yet, but it would sure be a great addition. Feel free to share your view on how this should be implemented, I'm stumped at the moment.
not sure what this means!
tbh, I don't either... I think that in order to benchmark prompt changes made to parts of OC or to agents made by users, having a proper way to A/B test those against a set of pre-made tasks and context or something a user could 'save' and re-use for testing purposes would be incredibly valuable.
A testing framework of sort, so everyone's prompts, including OC internal prompting can be fine tuned using an automated, scientific methodology. Less winging it, better handling of run to run variance and such...
That being said, idk what shape this 'thing' should be
I think this is a great idea. A benchmark to actually visualize how the prompts affect speed/code processing. It would also be good because it would allow you to benchmark different agent orchestrations.
I was thinking of predefined prompts that ALWAYS force the same output, and then you could output a score based on time, etc. Internally, you could address this as /benchmark.