dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Added more details to the discussion of optimizers/teleprompters.

Open rpgoldman opened this issue 1 year ago • 5 comments

Add class diagrams for the teleprompters, to be included in the documentation.

Expanded the discussion to try to clarify what the various optimizers do.

rpgoldman avatar May 01 '24 22:05 rpgoldman

This pull request needs a bit of screening. In particular,

  1. there are comments inline in 6-optimizers.md which indicate places where I was not sure that I was explaining correctly. Those should be corrected if necessary, and then the comments removed.
  2. There's a minor FIXME in optuna that could be taken care of (or, if it's wrong, could just be removed).
  3. There's another minor FIXME in bootstrap.py
  4. There's a "QUESTION:" in bootstrap.py that shows a place where I got puzzled. It might be that the variables could be given better names, or it might be that I'm just missing something.

Finally, I don't know how docusaurus works so the method I used to put an image into 6-bootstrap.md might have been wrong.

rpgoldman avatar May 01 '24 22:05 rpgoldman

Hi @rpgoldman , thanks for the contributions to the documentation. These are much needed!

The diagram is great. Could you remove the other image formats and add just the png to the relevant documentation as is done for images like here:

I made a pass over the documentation and made some corrections. Also removed the in-line comments/questions and moved them here. Feel free to follow up on any if needed.

TBQH, I don't understand how Optuna does this. As far as I can tell it simply chooses best based on multiple evaluations, rather than a single one, and mention of "hyperparameters" seems to be a red herring.

Optuna is similar to the BootstrapFewShotWithRandomSearch optimization, simply replacing the random search optimization with the Optuna objective optimization, treating the candidate score as the variable to optimize over for each program candidate and running it over a set of trials. The outputted compiled program mirrors the automatic selection of few-shot examples in the prompt.

I'm not at all sure that this is right. I couldn't follow the KNN code in the repo, so I just assumed that dspy was trying to cover the space of possible examples by picking centers of different clusters.

The KNNFewShot optimizer essentially clusters the provided set of training examples and applies the fewshot optimization given this example clustering. Feel free to check out this reference notebook for how to use it! https://github.com/stanfordnlp/dspy/blob/733a12784008f56ccd9f0f2d1393cef1161b3c6a/examples/knn.ipynb#L141

Wouldn't it make sense to simply use LabeledFewShot with k set to use all of the demos?

This may lead to some overfitting and BootstrapFewShot in facts covers this with max_labeled_demos but also provides bootstrapped examples from the model to offer more model-representative behavior in the compiled prompt. In the case of less examples, it may make more sense to use a model with larger parameters in compile-time to have more accurate bootstrapped examples, and then use a smaller model in inference time with this learned behavior.

The following example says that "we want to "bootstrap" (i.e., self-generate) 8-shot examples of your program's steps."  But won't it actually give      6 demonstrations, 3 taken from the examples (max_labeled_demos=3) and 3 self-generated (max_bootstrapped_demos=3)?  Also, aren't the defaults      16 labeled + 4 bootstrapped for a total of 20-shot prompting awfully high?

Fixed the typo. The defaults are just configurations used during experiments in the paper and may not be appropriate for all use cases, hence left configurable and as maximums.

QUESTION: What is the meaning of self.validation and self.valset? Why is it that valset overrides validation if it is supplied? What is the relationship between the valset parameter to compile and the trainset parameter? I note that none of the examples in the docs seem to use this parameter.

valset is simply if you have a validation split from a trainset that you would like to optimize the program on and can be particularly useful in BootstrapFewShotWithRandomSearch when determining scores over a set of candidates. This is not to be confused with an "evalset"! This is left as an optional parameter if the user wants to explore this validation split or else the optimization takes care of it with a randomized selection of train examples to bootstrap and validate on.

arnavsinghvi11 avatar May 05 '24 00:05 arnavsinghvi11

Hi @rpgoldman , thanks for the contributions to the documentation. These are much needed!

The diagram is great. Could you remove the other image formats and add just the png to the relevant documentation as is done for images like here:

Would it be OK to retain the .dot file, since that is the source of all the other formats? Would it help to put a comment into the Markdown file to explain how I generated the dot file? I've removed the pdf file for now.

I made a pass over the documentation and made some corrections. Also removed the in-line comments/questions and moved them here. Feel free to follow up on any if needed.

I marked 2 comments that weren't questions, just explanations, that I thought might be worth maintaining.

TBQH, I don't understand how Optuna does this. As far as I can tell it simply chooses best based on multiple evaluations, rather than a single one, and mention of "hyperparameters" seems to be a red herring.

Optuna is similar to the BootstrapFewShotWithRandomSearch optimization, simply replacing the random search optimization with the Optuna objective optimization, treating the candidate score as the variable to optimize over for each program candidate and running it over a set of trials. The outputted compiled program mirrors the automatic selection of few-shot examples in the prompt.

See my comment on the file -- even though Optuna is a hyperparameter optimizer, it doesn't look like dspy uses it for that purpose here. It looks like it's just optimizing the choice of examples, which isn't a hyperparameter.

I'm not at all sure that this is right. I couldn't follow the KNN code in the repo, so I just assumed that dspy was trying to cover the space of possible examples by picking centers of different clusters.

The KNNFewShot optimizer essentially clusters the provided set of training examples and applies the fewshot optimization given this example clustering. Feel free to check out this reference notebook for how to use it!

It wasn't clear to me what the purpose of the clustering was. That's what I was trying to explain -- does dspy use the clusters as I suggested, to make sure that the space is covered by choosing elements from different clusters, instead of choosing a bunch of examples from a single cluster.

QUESTION: What is the meaning of self.validation and self.valset? Why is it that valset overrides validation if it is supplied? What is the relationship between the valset parameter to compile and the trainset parameter? I note that none of the examples in the docs seem to use this parameter.

valset is simply if you have a validation split from a trainset that you would like to optimize the program on and can be particularly useful in BootstrapFewShotWithRandomSearch when determining scores over a set of candidates. This is not to be confused with an "evalset"! This is left as an optional parameter if the user wants to explore this validation split or else the optimization takes care of it with a randomized selection of train examples to bootstrap and validate on.

One thing I still don't understand is why the term valset is used for the argument instead of devset. I will see about tweaking the docstring to clarify according to your explanation, but it might be helpful to say why this new term is introduced.

rpgoldman avatar May 05 '24 16:05 rpgoldman

P.S. I don't know what the Ruff fix is, I'm afraid. If there's a pointer somewhere that explains it, please let me know.

rpgoldman avatar May 05 '24 16:05 rpgoldman

The KNNFewShot optimizer essentially clusters the provided set of training examples and applies the fewshot optimization given this example clustering. Feel free to check out this reference notebook for how to use it!

https://github.com/stanfordnlp/dspy/blob/733a12784008f56ccd9f0f2d1393cef1161b3c6a/examples/knn.ipynb#L141

This notebook refers to "kNN Few-Shot":

This notebook shows how KNN few-shot can be implemented...

I figure it would help to add a reference. Do you know if this article is what that refers to? If so, I could add that link to the notebook in this MR

rpgoldman avatar May 05 '24 18:05 rpgoldman

Would it be OK to retain the .dot file, since that is the source of all the other formats? Would it help to put a comment into the Markdown file to explain how I generated the dot file? I've removed the pdf file for now.

Is it important for the documentation to keep the .dot file? I think it would be best to only include final product images on the repo, as done with other documentation this way (https://github.com/stanfordnlp/dspy/blob/733a12784008f56ccd9f0f2d1393cef1161b3c6a/docs/docs/deep-dive/data-handling/built-in-datasets.mdx#L62). (We'd like to avoid adding too much non-code related files besides the hosted dspy-docs subtree).

P.S. I don't know what the Ruff fix is

Running ruff check . --fix-only and pushing will fix it!

It wasn't clear to me what the purpose of the clustering was. That's what I was trying to explain -- does dspy use the clusters as I suggested, to make sure that the space is covered by choosing elements from different clusters, instead of choosing a bunch of examples from a single cluster.

Yes, DSPy uses the KNN technique to pick a diverse set of examples from different clusters and then optimize using FewShot with examples pre-optimized using KNN (making the bootstrapping process stronger). This will be more useful when there's a lot of data over random spaces and using KNN helps optimize the trainset using for BootstrapFewShot (related to #77). The notebook details this with an example of DSPy KNN few-shot.

One thing I still don't understand is why the term valset is used for the argument instead of devset. I will see about tweaking the docstring to clarify according to your explanation, but it might be helpful to say why this new term is introduced.

I think this is also a bit semantics-related and can remain unchanged for now, unless there is a strong reason to change otherwise (and will likely need refactoring across the rest of the repo if so).

arnavsinghvi11 avatar May 05 '24 23:05 arnavsinghvi11

Would it be OK to retain the .dot file, since that is the source of all the other formats? Would it help to put a comment into the Markdown file to explain how I generated the dot file? I've removed the pdf file for now.

Is it important for the documentation to keep the .dot file? I think it would be best to only include final product images on the repo, as done with other documentation this way (

https://github.com/stanfordnlp/dspy/blob/733a12784008f56ccd9f0f2d1393cef1161b3c6a/docs/docs/deep-dive/data-handling/built-in-datasets.mdx#L62 ). (We'd like to avoid adding too much non-code related files besides the hosted dspy-docs subtree).

Done!

P.S. I don't know what the Ruff fix is

Running ruff check . --fix-only and pushing will fix it!

Done! I see now that it's a linter.

rpgoldman avatar May 06 '24 14:05 rpgoldman

I added a comment to the markdown to explain the process of generating the class hierarchy figure, so that it can be updated later.

rpgoldman avatar May 06 '24 14:05 rpgoldman

One thing I still don't understand is why the term valset is used for the argument instead of devset. I will see about tweaking the docstring to clarify according to your explanation, but it might be helpful to say why this new term is introduced.

I think this is also a bit semantics-related and can remain unchanged for now, unless there is a strong reason to change otherwise (and will likely need refactoring across the rest of the repo if so).

I think it would be best to simply note this deviation from the otherwise standard use of "devset" somewhere in the documentation. If one wanted to do more, I'd say just introduce devset as an alternative parameter name, and bind valset to the value of the devset parameter if supplied. In the best of all possible worlds, I'd suggest trying to make the usage consistent across the library, but this is only a minor point.

rpgoldman avatar May 06 '24 14:05 rpgoldman

If you are happy with what's there now, I think it's ok to merge.

rpgoldman avatar May 06 '24 14:05 rpgoldman

@arnavsinghvi11 Your explanation of KNN was very helpful; I pulled a couple of sentences into the Markdown.

rpgoldman avatar May 06 '24 14:05 rpgoldman

P.S. The pointer to the KNN notebook probably should go somewhere else, but I suggest keeping it here until there's a page for the KNN optimizer added to the Teleprompters/Optimizers section of the "Deep Dive."

rpgoldman avatar May 06 '24 14:05 rpgoldman

Sorry, forgot to clear the "Draft" flag.

rpgoldman avatar May 07 '24 03:05 rpgoldman

Thanks @rpgoldman for this amazing PR on documentation. Left a small comment for you to give yourself credit for the PNG and should be ready to merge. (I left the comments you had for generating the PNG since it makes sense for that process, but lmk if you wanted to remove that before merging).

arnavsinghvi11 avatar May 11 '24 17:05 arnavsinghvi11

Thanks @rpgoldman !

arnavsinghvi11 avatar May 11 '24 21:05 arnavsinghvi11