tree-of-thought-prompting
tree-of-thought-prompting copied to clipboard
Test dataset of questions to score reasoning
This indeed greatly improves prompting, although one question may be not very representative for the whole approach. To measure suggested solutions properly, shall we create a test dataset of questions to evaluate the results that we get from each prompt?
A test dataset would be a great idea.
There are many frameworks for testing LLMs available now, such as https://github.com/openai/human-eval