prompttools
prompttools copied to clipboard
experiment.evaluate() shows stale evaluation results
🐛 Describe the bug
Hi folks,
Thanks again for your work on this library.
I noticed an issue where similarity scores do not get updated when I change my expected
fields. Only when I re-run the experiment are the values updated.
Bug
Steps to reproduce:
models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0613"]
messages = [
[
{"role": "system", "content": "Who is the first president of the US? Give me only the name"},
]
]
temperatures = [0.0]
experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
experiment.run()
experiment.visualize()
from prompttools.utils import semantic_similarity
experiment.evaluate("similar_to_expected", semantic_similarity, expected=["George Washington"] * 2)
experiment.visualize()
from prompttools.utils import semantic_similarity
experiment.evaluate("similar_to_expected", semantic_similarity, expected=["Lady Gaga"] * 2)
experiment.visualize() # the evaluation results here indicate that "Lady Gaga" is semantically identical to "George Washington"
In my opinion, evaluate()
should re-compute metrics every time it is run, rather than depending/being coupled to another function (run()
). I haven't tested it on other eval_fns, but it could be worth testing if this is the case as well.
Your observation is correct. Currently, if a metric already exists (which is "similar_to_expected"
in your case), it raises a warning (as seen in your notebook "WARNING: similar_to_expected is already present, skipping"
) rather than overwriting it.
If you change the metric name given in the second .evaluate
call (i.e. experiment.evaluate("similar_to_expected_2", ...)
), it will compute another column.
We are open to considering overwriting it even when the existing metric already exists. Let us know what you think.
Thank you for this issue, I changed the variable name but still, the response column is stale. Any leads on this issue? I use python version 3.11.5
Hi @Sruthi5797,
Can you post a minimal code snippet of what you are running? Also, are you seeing any warning message?