dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Dataset destroys Example.input_keys values

Open jsleight opened this issue 4 months ago • 2 comments

Minimal example (on dspy v2.4.0):

import dspy
examples = [dspy.Example(foo=f, bar=b).with_inputs("foo") for f, b in zip("abcd", "1234")]
print(examples)  # [Example({'foo': 'a', 'bar': '1'}) (input_keys={'foo'}), Example({'foo': 'b', 'bar': '2'}) (input_keys={'foo'}), Example({'foo': 'c', 'bar': '3'}) (input_keys={'foo'}), Example({'foo': 'd', 'bar': '4'}) (input_keys={'foo'})]

from dspy.datasets.dataset import Dataset

class MyDataset(Dataset):
    def __init__(self, examples):
        super().__init__(train_size=1, dev_size=1, test_size=1)
        self._train = [examples[0]]
        self._dev = [examples[1]]
        self._test = [examples[2]]

dataset = MyDataset(examples)
print(dataset.train)  # [Example({'foo': 'a', 'bar': '1'}) (input_keys=None)]
print(dataset.dev)    # [Example({'foo': 'b, 'bar': '2'}) (input_keys=None)]
print(dataset.test)   # [Example({'foo': 'c', 'bar': '3'}) (input_keys=None)]

Expected to have the input_keys persist through the Dataset object. This line seems to be the problem.

jsleight avatar Apr 24 '24 15:04 jsleight

Hi @jsleight , thanks for raising this. Currently, the behavior lies in declaring your Dataset type first and then setting the inputs - example from intro.ipynb:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

but it does make sense to me to have input_keys() persist if they exist. Feel free to push a PR for this change!

arnavsinghvi11 avatar Apr 27 '24 22:04 arnavsinghvi11

I might have some time to make a PR. I can envision a couple of approaches so interested to see which you'd prefer.

  1. Just change the line in Dataset that creates copies of the examples to also do with_inputs.
  2. A bit more fundamental change to Examples to have Examples(**example) persist the input_keys. Would make the Dataset class persist the input_keys while adding a bit more functionality to the Examples class. But idk if you'd like Examples to work this way or not.

jsleight avatar Apr 29 '24 15:04 jsleight