dspy
dspy copied to clipboard
[Insights & Questions] About MIPROV2
Firstly, thanks for the amazing package! I've done a sharing session with senior tech managers in my company, and they were quite convinced. I have some interesting insights to share and also some questions to discuss.
Background
My task is quite simple: I need to optimize a signature for intention classification (only a couple of classes).
Maybe I can contribute an example use case? As text classification is one of the most common daily use cases. But unfortunately, I found the link to https://github.com/stanfordnlp/dspy/blob/main/CONTRIBUTING.md is missing...
I've done some simple ablation studies:
- Optimizer: MIPROV2 0-Shot, LM: gpt-4o-mini or gpt-4o
- Optimizer: MIPROV2 Few Shots, LM: gpt-4o-mini or gpt-4o
After the optimization, I did inference with the optimized prompts using gpt-4o-mini or gpt-4o and evaluated the results. I used accuracy as the metric.
Insights and Questions
- Generally speaking, MIPROV2 Few Shots yields better results than MIPROV2 0-Shot.
I was a bit confused here, as the final optimized version from MIPROV2 Few Shots actually didn't contain any few-shot examples, regardless of the LM used. Although they yielded almost the same results for gpt-4o-mini inference, MIPROV2 Few Shots yielded significantly better results than MIPROV2 0-Shot when inferencing with gpt-4o.
It's quite weird, because if there were no few-shot examples in the final prompt, why would optimization using few-shot settings yield better instructions?
-
When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.
-
When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.
This could be a significant finding, as it means that we can potentially use a more cost-effective model for prompt optimization that will perform well even when applied to a much larger model.
I guess it might be because if you can write clear instructions for a small model, it naturally becomes clearer for a larger model, but not reverse. I don't know whether it makes sense or not. I will continue the study with llama 3.2 to gain more insights, I guess.