dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Optimized prompt for multi-class classification contains only a subset of classifiers

Open aaronbriel opened this issue 5 months ago • 5 comments

I followed the tutorials for optimizing a DSPy program for the task of multi-class classification and the "optimized" prompt resulted in a small subset of the available classifiers, making it unsuitable for consideration in a production environment.

I'll provide the relevant chunks of notebook code but I won't be able to actually show the prompt itself as it contains production data. Hopefully this is sufficient for identification of what may be the issue.

ISSUE 1: The main issue is that the final "optimized" prompt only contains single few-shot samples for 8 of the 41 classifiers (with one of the classifiers having 2 samples). I expected it to contain multiple few-shot samples for each of the 41 classifiers.

ISSUE 2: The secondary issue was that the evaluation metric showed a rather low score of 64.34. I expected this to be much higher since I trained with a decent size ground truth dataset (that was manually curated for accuracy) of 50 samples per classifier.

I'm guessing this is related to my optimizer configuration but I'm not sure what to adjust. Please advise. Thank you!

# source .env file
import os
import sys
from dotenv import load_dotenv
load_dotenv()

# Add the current directory to PYTHONPATH
sys.path.append('/Users/abriel/repos/projectname/')
sys.path.append(os.getenv('PYTHONPATH'))
sys.path.append(os.getenv('DEFAULT_MODEL'))

import os
import re
import dspy
from dspy import Predict
from dspy.datasets import DataLoader
from dspy.signatures import ensure_signature
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
import pandas as pd

# Load the intent keys from an external source
from src.variables import intent_keys

# Set up the model using OpenAI's GPT
gpt4o = dspy.OpenAI(model=os.environ['DEFAULT_MODEL'])
dspy.configure(lm=gpt4o)

# Define the Intent Classifier Signature
class IntentClassifier(dspy.Signature):
    """
    Classifies a person's response into one of the given intents based on the conversation
    between a two people, person1 and person2.
    """
    conversation = dspy.InputField(
        desc="A conversation between person1 and person2.",
        prefix="Conversation: "
    )
    script_question = dspy.InputField(
        desc="Person1 question.",
        prefix="Question: "
    )
    response = dspy.InputField(
        desc="Person2's response to the question from person1.",
        prefix="Response: "
    )
    intent = dspy.OutputField(desc="One of the following intents: " + ", ".join(intent_keys))

# Create the IntentClassifierModule that incorporates ChainOfThought
class IntentClassifierModule(dspy.Module):
    """
    A module that defines the intent classification process.
    """
    def __init__(self):
        super().__init__()
        self.signature = IntentClassifier
        self.predictor = dspy.ChainOfThought(signature=self.signature)

    def forward(self, conversation, question, response):
        """
        Runs the forward pass for classifying intents.
        """
        result = self.predictor(
            conversation=conversation,
            question=question,
            response=response
        )
        return dspy.Prediction(
            intent=result.intent
        )

# Load and split datasets
dl = DataLoader()

full_dataset = dl.from_csv(
    "dataset_name.csv",
    fields=("conversation", "question", "response", "intent"),
    input_keys=("conversation", "question", "response")
)
splits = dl.train_test_split(dataset, train_size=0.8)
train_dataset = splits['train']
test_dataset = splits['test']

# Validation function to compare predicted and actual intents
def validate_answer(example, pred, trace=None):
    """
    Validates the prediction by comparing it to the actual intent.
    """
    return example.intent.lower() == pred.intent.lower()

# Configure the optimizer
config_ = {
    "max_bootstrapped_demos": 8,
    "max_labeled_demos": 8,
    "num_candidate_programs": 10,
    "num_threads": 4
}

# Use BootstrapFewShotWithRandomSearch to optimize the prompt
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=validate_answer,
    **config_
)

# Compile and save the optimized program
optimized_program = teleprompter.compile(IntentClassifierModule(), trainset=train_dataset)
optimized_program.save('/Users/abriel/repos/projectname/optimized_intent_classifier.json')

This resulted in successful "training", running in 8 sets. I then completed an evaluation:

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=test_dataset, num_threads=1, display_progress=True, display_table=5)
evaluator(optimized_program, metric=validate_answer)

I then checked the optimized prompt by doing:

gpt4o.inspect_history(n=1)

ISSUE 1: The resulting optimized_intent_classifier.json had single few-shot samples for only 8 intents, with one of the intents having 2 samples. There are 41 intents, so I expected multiple few-shot samples for each of the 41 intents.

ISSUE 2: This showed a final score of 64.34, which was admittedly far lower than expected as I provided a ground truth dataset of 50 samples per intent.

aaronbriel avatar Sep 18 '24 15:09 aaronbriel