promptfoo icon indicating copy to clipboard operation
promptfoo copied to clipboard

Add support for threaded tests

Open SiXoS opened this issue 1 year ago • 8 comments

What

This is a draft for adding support for threaded tests that can be used with e.g. OpenAI chats or assistants.

I created it as a very early draft to ask if this is a feature you want to have, and as a base for discussions on implementation details. I hope that you are interested in the suggestion.

Why

In my case, my assistant will be used almost exclusively with a sequence of questions and answers. This can not be tested sufficiently with a single request. I can not initiate the provider with a list of messages since the OpenAI assistants only allows the client to add messages with role: user.

How

I have a suggestion for syntax but I expect feedback and it's welcome! I have an example in the PR. Basically, a test case can have a property called thread which is a list of test cases that will be run sequentially on the same thread. This is not recursive. A test case that has thread can not have assert and vice versa.

The ApiProvider uses the CallApiContextParams.thread to keep track of which threadId to use. See the OpenAiAssistantProvider for an example.

One concern is that this use-case doesn't fit the structure of having one main prompt and then variables per case. You probably want a prompt per case.

What makes this a draft?

  1. I have to add a lot more tests.
  2. Concurrency has to be handled correctly, threaded tests can't be concurrent.
  3. I have not even looked at what the output looks like, that has to be inspected.
  4. Add implementation for all relevant providers.

Let me know what you think about the feature suggestion and if you have any concerns with my suggested implementation.

SiXoS avatar Mar 29 '24 13:03 SiXoS

Thank you for putting this together! It's an awesome start.

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another. Definitely see a lot of people jumping through hoops to build chat conversations.

typpo avatar Mar 31 '24 09:03 typpo

I wonder if there is a way to generalize the solution so it's not just for OpenAI assistants, but for anyone who wants to test a chat-style conversation with one message after another.

I think the current solution could already support other providers. The two solutions for conversations that I've seen are either a conversation id or you send the whole conversation every time. The threadId propery in ThreadContext is of type any so for the providers where you send entire conversation, the convo could simply be stored in threadId. Of course, the property should probably be renamed to thread

SiXoS avatar Apr 02 '24 09:04 SiXoS

Another big issue that I realized is variable combinations. Let's say you have the following test case:

prompts: "{{ prompt }}"

tests:
  - thread:
      - vars:
          prompt: [ I want to write a poem, I want to write a novel ]
        assert:
          - type: icontains
            value: "What about?"
      - vars:
          prompt: Edgar Allan Poe
        assert:
          - type: icontains
            value: |
              Edgar Allen Poe
              Had a pretty big toe

In this test, both questions would be posted to the same thread which doesn't make much sense. The conversation would look something like:

  1. I want to write a poem
  2. What about?
  3. I want to write a novel
  4. Oh, I thought you wanted a poem
  5. Edgar Allan Poe
  6. Edgar Allen Poe Had a pretty big toe

So. These are the two suggestions I have, do you have anything else in mind?

Do not allow variable expansions in sub-tests

This is easy implementation-wise but could be unintuitive to users. It wouldn't be clear that variable expansion is ignored. Might be ok, though? Should be pretty clear from the output.

Just leave it as is

Allow variables to be expanded as above. It's a bit weird but easy to avoid.

SiXoS avatar Apr 02 '24 09:04 SiXoS

@typpo I have implemented a first working solution. I have addressed the earlier concerns of mine:

  1. I have added a few more tests
  2. Threaded tests are run synchronously
  3. The output tables looks good as far as I'm concerned. It's not very clear which tests that are threaded but maybe that's fine?
  4. I still have to add implementation for all relevant providers but I figured that I can ask for feedback before I add the implementation everywhere.

This is how I solved variable combinations (for an example, see examples/openai-assistant-thread/promptfooconfig.yaml):

Variables that are specified at the top level or default tests are expanded before the thread test cases are constructed. Variables that are specified on the sub-level are expended for that specific sub-level only.

Let me know if you have any further feedback.

SiXoS avatar May 14 '24 13:05 SiXoS

I added implementation for all the relevant providers

SiXoS avatar May 15 '24 13:05 SiXoS

Thank you, @SiXoS! I missed this PR because it's further down the list. I'll give it a proper review and test. Very excited to see this

typpo avatar May 16 '24 04:05 typpo

Great! I made an additional small change that I think makes the code easier to understand, I hope that it doesn't mess with your review process too much. I wasn't very happy with how the test cases loop was constructed. The order in which cells were handled were a bit weird. So i changed it to fill column-by-column. Using the previous example, this is how tables are now filled:

+-----------+--------+
| Outputs   |        |
+-----------+--------+
| gpt-3     | gpt-4  |
+-----------+--------+
| 1         | 4      |
+-----------+--------+
| 2         | 5      |
+-----------+--------+
| 3         | 6      |
+-----------+--------+

SiXoS avatar May 17 '24 07:05 SiXoS

Hi @typpo! Let me know if there is anything that I can do to make it easier to do a review. I'm also open to very open-ended feedback, like "this part is bad". As long as I know the problem area, I think that I can figure out an improvement :)

SiXoS avatar May 24 '24 09:05 SiXoS