dspy [Insights & Questions] About MIPROV2

[Insights & Questions] About MIPROV2

Open SauceCat opened this issue 4 months ago • 1 comments

Firstly, thanks for the amazing package! I've done a sharing session with senior tech managers in my company, and they were quite convinced. I have some interesting insights to share and also some questions to discuss.

Background

My task is quite simple: I need to optimize a signature for intention classification (only a couple of classes).

Maybe I can contribute an example use case? As text classification is one of the most common daily use cases. But unfortunately, I found the link to https://github.com/stanfordnlp/dspy/blob/main/CONTRIBUTING.md is missing...

I've done some simple ablation studies:

Optimizer: MIPROV2 0-Shot, LM: gpt-4o-mini or gpt-4o
Optimizer: MIPROV2 Few Shots, LM: gpt-4o-mini or gpt-4o

After the optimization, I did inference with the optimized prompts using gpt-4o-mini or gpt-4o and evaluated the results. I used accuracy as the metric.

Insights and Questions

Generally speaking, MIPROV2 Few Shots yields better results than MIPROV2 0-Shot.

I was a bit confused here, as the final optimized version from MIPROV2 Few Shots actually didn't contain any few-shot examples, regardless of the LM used. Although they yielded almost the same results for gpt-4o-mini inference, MIPROV2 Few Shots yielded significantly better results than MIPROV2 0-Shot when inferencing with gpt-4o.

It's quite weird, because if there were no few-shot examples in the final prompt, why would optimization using few-shot settings yield better instructions?

When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.
When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.

This could be a significant finding, as it means that we can potentially use a more cost-effective model for prompt optimization that will perform well even when applied to a much larger model.

I guess it might be because if you can write clear instructions for a small model, it naturally becomes clearer for a larger model, but not reverse. I don't know whether it makes sense or not. I will continue the study with llama 3.2 to gain more insights, I guess.

Oct 07 '24 16:10 SauceCat

dspy dspy copied to clipboard

[Insights & Questions] About MIPROV2

Background

Insights and Questions

dspy
dspy copied to clipboard