dspy icon indicating copy to clipboard operation
dspy copied to clipboard

TypedChainOfThought: Too many retries trying to get the correct output format.

Open wtf-is-flying opened this issue 1 year ago • 3 comments

Hello everyone!

I want to build a metric ta compares generated guidelines to reference guidelines and outputs a score.

The input are extracted_guidelines and generated_guidelines. Importantly, these are lists of string

The output is a score between 0 and 1 evaluating the closeness between the extracted and generated guidelines.

Here is my implementation:

class SimpleMetricSignature(dspy.Signature):
    """You are given guidelines that a student extracted from a text.
    A teacher wrote the correct guidelines based on the text.
    You goal is to compute a score between 0 and 1 assessing the relevance of guidelines extracted by a student from a text by comparing them to the correct guidelines.
    If both extracted and correct guidelines are empty, return 1.
    If extracted_guidelines is not empty and correct_guidelines is empty, return 0."""

    extracted_guidelines: list = dspy.InputField(desc="the extracted guidelines, reformulated as questions")
    correct_guidelines: list = dspy.InputField(desc="the correct guidelines to compare the extracted guidelines to")
    answer: float = dspy.OutputField(desc="Must be a float in decimal form between 0 and 1", ge=0., le=1.)

class BaseSimpleMetric(dspy.Module):
    def __init__(self):
        super().__init__()
        self.compute_metric = dspy.TypedChainOfThought(SimpleMetricSignature)
        
    def forward(self, gold, pred) -> dspy.Prediction:
        return = self.compute_metric(
            extracted_guidelines = pred.questions,
            correct_guidelines = gold.questions
        )

As my inputs aren't text but list, I have to use a TypedChainOfThoughtto handle these correctly, otherwise I get the following error: AssertionError: Need format_handler for extracted_guidelines of type <class 'list'>.

I then want to optimize my metric. So, first, I evaluate it on my test set. Unfortunately, I get the following error: Too many retries trying to get the correct output format. Try simplifying the requirements.

When inspecting the output, I get the following:

You are given guidelines that a student extracted from a text.
    A teacher wrote the correct guidelines based on the text.
    You goal is to compute a score between 0 and 1 assessing the relevance of guidelines extracted by a student from a text by comparing them to the correct guidelines.
    If both extracted and correct guidelines are empty, return 1.
    If extracted_guidelines is not empty and correct_guidelines is empty, return 0.
---
Follow the following format.
Extracted Guidelines: the extracted guidelines, reformulated as questions
Correct Guidelines: the correct guidelines to compare the extracted guidelines to
Past Error in Answer: An error to avoid in the future
Past Error (2) in Answer: An error to avoid in the future
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: Must be a float in decimal form between 0 and 1 (Respond with a single float value)
---
Extracted Guidelines: [a list of str]

Correct Guidelines: [a list of str]

Past Error in Answer: ValueError("could not convert string to float: '13/14 = 0.9286\\n\\nAnswer: 0.93'")

Past Error (2) in Answer: ValueError("could not convert string to float: '13/14 = 0.9286\\n\\nAnswer: 0.93'")

Reasoning: Let's think step by step in order to Reasoning: Let's think step by step in order to produce the answer. We need to compare the extracted guidelines to the correct guidelines and determine how many of the extracted guidelines match or closely align with the correct guidelines.
1. Compare each extracted guideline to the correct guidelines to see if they match or are closely related.
2. Count the number of matches or closely related guidelines.
3. Divide the number of matches by the total number of correct guidelines to get the relevance score.
Let's compare them:
[here the LLM is comparing the guidelines one by one]

Out of 14 extracted guidelines, 13 match or closely align with the correct guidelines.
Answer: 13/14 = 0.9286
Answer: 0.93

The problem here is that the LLM (I'm using GPT-4o) is outputing the answer two times:

  • once using a fractional form "13/14 = 0.9286"
  • once giving the final answer in decimal form

It appears that TypedChainOfThought does not handle this case very well, which leads to an error.

I tried some manual tuning (like adding a "computation_details" field) but to no success.

How could I address this problem? Should I try to use the standard ChainOfThought instead?

wtf-is-flying avatar May 23 '24 10:05 wtf-is-flying

How about using LabeledFewShot or similar teleprompter/optimizer? https://dspy-docs.vercel.app/docs/building-blocks/optimizers#what-dspy-optimizers-are-currently-available

That way the LLM has examples to show it how it is supposed to output the float.

tom-doerr avatar May 24 '24 22:05 tom-doerr

Hi Tom,

Thank you for your answer. In the meantime I managed to optimize my metric without error by sightly modifying my signature to the following:

class SimpleMetricSignature(dspy.Signature):
    """You are given guidelines that a student extracted from a text.
    A teacher wrote the correct guidelines based on the text.
    You goal is to compute a score between 0 and 1 assessing the relevance of guidelines extracted by a student from a text by comparing them to the correct guidelines.
    If both extracted and correct guidelines are empty, return 1.
    If extracted_guidelines is not empty and correct_guidelines is empty, return 0.
    Only the final result, a float in decimal form, must be present after "Answer:". """ # <--- This line has changed

    extracted_guidelines: list = dspy.InputField(desc="the extracted guidelines, reformulated as questions")
    correct_guidelines: list = dspy.InputField(desc="the correct guidelines to compare the extracted guidelines to")
    
    answer: float = dspy.OutputField(desc="Must be a float, not a fraction" , ge=0., le=1.) # <--- This line has changed

I then managed to optimized my metric without error on a train set of about 50 examples, using BootstrapFewShotWithRandomSearch.

I was pretty happy with the results, as I managed to go from a score of 95 to 99.5 on my test set (about 15 examples). 😄

Unfortunately, when I got to the part of optimizing my QuestionsGenerator, the problem surfaced again...

Next step: using my optimized metric to train my QuestionsGenerator

The goal is to have a QuestionsGenerator that extract guidelines from a text and reformulate them as questions.

Here is the code definition:

from typing import List

class SimpleQuestionsGeneratorSignature(dspy.Signature):
    """If the input text provided contains guidelines and compliance checks, generate a list of questions that can be asked about a document to check if it verifies the guidelines.
    If it doesn't, generate an empty list."""
    text: str = dspy.InputField(desc="may contain guidelines or compliance checks")
    questions: List[str] = dspy.OutputField(desc="a list of questions")

class SimpleQuestionsGenerator(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_questions = dspy.TypedChainOfThought(SimpleQuestionsGeneratorSignature)

    def forward(self, text):
        prediction = self.generate_questions(text=text)

My goal is to, again, try to optimize my program, this time using the metric I obtained before.

I first want to run a COPRO on a dataset of about 70 examples (text/questions pairs):

from dspy.teleprompt import COPRO

eval_kwargs = dict(num_threads=16, display_progress=True, display_table=0)

copro_teleprompter = COPRO(
    metric=metric, # your_defined_metric, 
    breadth=2, # num_new_prompts_generated, 
    depth=1, # times_to_generate_prompts, 
    init_temperature=0.0001, # prompt_generation_temperature, 
    log_dir="logs"
)

        return prediction

compiled_program_optimized_signature = copro_teleprompter.compile(
    SimpleQuestionsGenerator(), 
    trainset=train_set, 
    eval_kwargs=eval_kwargs)

This time the optimization interrupts itself because of the same error as before: ('Too many retries trying to get the correct output format. Try simplifying the requirements.', {'answer': 'ValueError("could not convert string to float: \'Answer: 0.25\'")'}).

I think this is a bit curious because when I inspect the history, this following result triggers the error:

You are given guidelines that a student extracted from a text.
    A teacher wrote the correct guidelines based on the text.
    You goal is to compute a score between 0 and 1 assessing the relevance of guidelines extracted by a student from a text by comparing them to the correct guidelines.
    If both extracted and correct guidelines are empty, return 1.
    If extracted_guidelines is not empty and correct_guidelines is empty, return 0.
    Only the final result, a float in decimal form, must be present after "Answer:".

---

Follow the following format.

Extracted Guidelines: the extracted guidelines, reformulated as questions

Correct Guidelines: the correct guidelines to compare the extracted guidelines to

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Must be a float, not a fraction (Respond with a single float value)

---

[Here there are three bootstrapped examples]

---

Extracted Guidelines: []

Correct Guidelines:
Reasoning: Let's think step by step in order to compute the answer. Since both the extracted guidelines and the correct guidelines are empty, there is no content to compare. According to the given instructions, if both extracted and correct guidelines are empty, the score should be 1.

Answer: 1.0

Reasoning: Let's think step by step in order to Answer: 0.25

The completion of GPT-4o starts after "in order to " on the last line, and consists of just "Answer 0.25".

Questions

I'm not sure what is happening here. The metric does seem to correctly generate an Answer, but maybe the lack of a Reasoning field triggers the error?

wtf-is-flying avatar May 25 '24 13:05 wtf-is-flying

Did you set a token limit when initializing your LM? COPRO might have just run out of tokens when generating the new instruction which resulted in the cutoff. Is the reason you are using COPRO that you don't have labeled examples for the Question task? You can also use bootstrap optimizers/teleprompters such as BootstrapFewShotWithRandomSearch, they don't need to have labeled examples. They generate demos/examples, and demos are much more helpful for the LM than the signatures/instructions that COPRO is optimizing. From my understanding COPRO is one, if not the, hardest teleprompter to use.

tom-doerr avatar May 25 '24 13:05 tom-doerr