rasa icon indicating copy to clipboard operation
rasa copied to clipboard

`rasa test` to include action server

Open ArjaanBuijk opened this issue 3 years ago • 25 comments

Description of Problem: How do I test that my bot is working?

Currently, rasa test :

  • Fully tests NLU, including external entity extractors like duckling
  • Does not fully test e2e stories, because it does not actually call the action server.

This means that even when rasa test shows that all the e2e stories pass, it is not a guarantee that the bot is functional. One has to test all the e2e stories manually, which is a lot of work and it slows down the CDD iterations.

Doing tests with the action server included is especialy important during the CDD Fix step, or when upgrading a bot from Rasa OSS 1 to Rasa OSS 2.

A side benefit of including the action server during rasa test is that it would provide a nice method for debugging custom actions and it provides a declaritive mechanism to create full coverage tests of the custom actions. (https://github.com/RasaHQ/rasa/issues/4212#issuecomment-698888514)

Overview of the Solution: rasa test calls the action server.

Blockers (if relevant): It is not a blocker, because it is standard testing practice. But, the user must be careful that the action server is running in some kind of local mode or testing mode. Eg. during testing it is not actually updating a production data-base or calls external services.

Slack thread

ArjaanBuijk avatar Sep 25 '20 14:09 ArjaanBuijk

Exalate commented:

TyDunn commented:

Anything you would add @akelad? We have discussed this in the past

TyDunn avatar Sep 25 '20 14:09 TyDunn

Exalate commented:

TyDunn commented:

Context behind past decision to not execute actions in test stories: https://forum.rasa.com/t/snapshot-based-testing-with-rasa/13318

TyDunn avatar Sep 29 '20 09:09 TyDunn

Exalate commented:

akelad commented:

not much to add to this, but we have had customers bring this up a few times in the past year.

Having a mock action server for testing I believe is common in the enterprise anyways -otherwise you would be modifying "real world things" with your manual tests as well. So I feel like allowing rasa test to include the action server again could make sense

akelad avatar Sep 29 '20 15:09 akelad

Exalate commented:

TyDunn commented:

I really think we should make running the actions a possibility, especially given another data point that Ella added to the Slack thread yesterday. I'm adding this to the Rasa Open Source 2.1 milestone to push this discussion more

TyDunn avatar Oct 14 '20 09:10 TyDunn

Exalate commented:

twerkmeister commented:

So I've looked a bit into this issue. I think this is actually quite a big architectural and dev experience decision and I would like to discuss this a bit more to make sure we are on a good path forward.

If I understand correctly the proposal is that when using rasa test the action server should run alongside so that custom action code is run to validate that slot_was_set is actually happening and use this as test cases for the custom actions. As a result, it would be easier to test that the actions are actually doing what they are supposed to do.

In the following I am comparing a unit testing-based approach that we currently advise in the docs to this approach (yellow box at the end of this section). While we have that hint in the rasa docs, there is no section on testing actions in the actions server docs ⚠️

Unit Testing

Pros
  1. Extensive test capabilities beyond simple equality for values. slot_was_set would currently just amount to direct equality check
  2. Simpler to test actions for multiple different values
  3. Fosters configurability and testability of actions from the get go as opposed to making it an afterthought
  4. Separating testing the machine learning model and manually written code leads to faster turn around time for developing actions
  5. Testing different environments for actions is straight forward (e.g. working service, failing service, slow service, ...)
Cons
  1. Seems Difficult/Cumbersome to manually write realistic tracker states as inputs for unit tests

Action server in rasa test

Pros
  1. some test cases are already there and we can use them
Cons
  1. Test capabilities are limited; Developing additional capabilities might be quite complex (building our own testing framework ). As stated above slot_was_set would currently just amount to direct equality check
  2. Tight coupling of model and action testing - I am thinking, if you want to debug an action, you might want to do so for specific test cases - this means we would ideally also have a mechanism to single out specific test stories from the set instead of running all of them. This already easily doable in unit testing - just pick your test case - but needs additional deliberation and work in the conversation setting.
  3. I think even with this capability in place, you might still wanna unit test your actions for situations when background services fail etc. You could do this here, but then you would need to specify environment configs alongside stories

In my opinion, we should go with unit testing approach and make it easier to understand and use:

  • Improve the docs (prominent section on how to test actions, best practices)
  • Provide some abstraction for the test cases, e.g. the one outlined here by @koaning
  • Potentially provide some method to generate complex tracker states (does this possibly already exist?)
    • option a) export tracker state from an interactive session -> paste data into the unit test.
    • option b) write conversation in the unit test -> simulate tracker state based on config, domain and the given conversation -> use resulting tracker state to test

What do you think @ArjaanBuijk , @koaning, @TyDunn. Is this an accurate overview of the choices and their merits? Did I miss something? Please discuss

twerkmeister avatar Mar 01 '21 18:03 twerkmeister

Exalate commented:

TyDunn commented:

From a high level, this seems like a fair overview at first glance. Curious to also hear what @wochinge thinks, since he spent some time in the code last Friday, as well as @akelad and the others you tagged

TyDunn avatar Mar 02 '21 16:03 TyDunn

Exalate commented:

wochinge commented:

Thanks for writing this detailed overview! 💯

In my opinion we need both. I see conversation tests as integration tests. From that perspective it makes sense to include the action server in the. conversation tests as you want to ensure the behavior of all components combined. It's less about checking certain slot values in my opinion, but rather whether the conversation flows as expected (e.g. that the custom action isn't causing the bot to say goodbye after the user said hi). This is of course also possible with unit testing but I think it's fragile as:

  • you need a very good understanding which custom action events cause which bot behavior
  • you need to have all stories / rules in mind when design unit tests for the action server
  • conversation tests and unit tests are separated. So when you change a story you need to think of two places to adapt instead of just one.

Considering a team workflow where the conversation designer comes up with the conversations and the developers implement them, I'd expect the conversation designer also to be the one who is acceptance testing the assistant's behavior. Ideally they can define a definition of done (stories of done 😆 ) right at the beginning which the developers can use to develop the bot against. Although this makes me cringe as developer, I'd argue that unit tests are less of a priority / have less value in this setting.

Tight coupling of model and action testing

Interesting point. I think actions which set featurized conversation state are somewhat part of the model 🤔

Slightly off topic: You're slot_was_set example touches a very interesting point. As far as I know our conversation tests do not test slot values at the moment. In my opinion this is less interesting for featurized slots as they will influence the conversation flow anyway, but it might be interesting to check for unfeaturized slot values (e.g. whether the form extract slots as expected).

wochinge avatar Mar 03 '21 09:03 wochinge

Exalate commented:

akelad commented:

+1 to Tobias' opinion, for me the full conversation tests are more important, i.e. the action server in rasa test option. Every customer we've ever worked with always needs to test the full flow, including the execution of the custom actions. Something we should consider though, is that you might not want to run your "prod custom action" because that might be modifying real world data. So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

akelad avatar Mar 03 '21 09:03 akelad

Exalate commented:

twerkmeister commented:

Thanks for the food for thought! Lots of new info here for me will look into these things

twerkmeister avatar Mar 03 '21 12:03 twerkmeister

Exalate commented:

wochinge commented:

So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

Up to them to create a proper integration test environment, no? If their custom actions need a database than their integration tests need a test database 🤷🏻

wochinge avatar Mar 03 '21 12:03 wochinge

Exalate commented:

akelad commented:

yeah i was kind of thinking aloud and came to this conclusion in the end as well

akelad avatar Mar 03 '21 12:03 akelad

Exalate commented:

TyDunn commented:

Something we should consider though, is that you might not want to run your "prod custom action" because that might be modifying real world data. So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

@erohmensing @ArjaanBuijk You mentioned some customers that you work with already run actions as part of whole conversation tests. Do you know how they handle this?

TyDunn avatar Mar 03 '21 12:03 TyDunn

Exalate commented:

ArjaanBuijk commented:

@TyDunn , There were always non-prod environments to run these tests, and those non-prod environments are complete copies of the prod environment, including the action server.

ArjaanBuijk avatar Mar 03 '21 14:03 ArjaanBuijk

Exalate commented:

twerkmeister commented:

Wrapping up my current thoughts and findings ...

@wochinge As some more context have a look at the conversation between @ArjaanBuijk and @TyDunn in slack from a couple months back:

The [...]-demo has a lot of action code for slot validation, and it sets additional slots on the fly. There is no way right now to test if this works, except by doing manual testing.

Arjaan continues:

Seems @ArjaanBuijk first and foremost cares about slot set events. At least in that conversation there was no mention of bot utterances. Maybe you can confirm?

From this context, I focused more on the slot events in my overview.

@Wochinge You raise an interesting point though about the bot utterances by the action. If I am not mistaken these dispatched utterances appear neither in training data nor in current test stories, or do they?

For example, the rasa-demo bot has an action for greeting users, which utters a bunch of messages. The training stories do not however refer to any of these utterances. Likewise the test stories do not refer to any of these utterances. Is this demo code done in a wrong way? I am not sure whether there is a way to capture these utterances in the training or test stories file formats as of now.

twerkmeister avatar Mar 09 '21 11:03 twerkmeister

Exalate commented:

wochinge commented:

If I am not mistaken these dispatched utterances appear neither in training data nor in current test stories, or do they?

No, they don't. Besides SlotSet events we there might be other "featurized" things to test though, e.g. ActiveLoop or FollowupAction etc.

wochinge avatar Mar 09 '21 12:03 wochinge

Exalate commented:

ArjaanBuijk commented:

@twerkmeister ,

I indeed mentioned the slots being set by actions, but there are also other events that need to be checked, as @wochinge points out.

A problem with not including a live action server in the e2e testing is that the e2e tests tend to go out of sync with the custom actions. It is very easy forget to update the e2e test when you update a custom action. Even when you write unit tests for the custom actions, you are not actually testing the full bot.

ArjaanBuijk avatar Mar 09 '21 19:03 ArjaanBuijk

Exalate commented:

m-vdb commented:

For the record, decreasing priority to normal. We'll need time to scope the feature properly and we need to prioritise it together with our other initiatives (see Slack thread in issue description for more info)

m-vdb avatar Mar 10 '21 09:03 m-vdb

Exalate commented:

akelad commented:

This has come up a bunch of times over the years from multiple customers

akelad avatar Apr 26 '21 16:04 akelad

Exalate commented:

melindaloubser1 commented:

One more data point from a user, in favour of testing custom actions:

In summary: The combination of not being able to direct a conversation based on intent x entity value, and not being able to run the custom actions that work around that limitation, is particularly frustrating.

Currently, entity values cannot directly influence action prediction, only entity types. If you autofill a categorical slot of the same name with the entity value, you can direct the next step of the conversation, but now it also gets set at every point in the conversation where that entity is extracted, which is not always desirable.

To work around this limitation, you can create a slot with a different name, and fill it from the entity using a custom action only in those stories where it is needed. Depending on how many instances of this you have, you can end up with many more custom actions than before, and none of them are run during testing.

indam23 avatar Apr 29 '21 15:04 indam23

Exalate commented:

rgstephens commented:

Are there any updates on this?

rgstephens avatar Sep 15 '21 18:09 rgstephens

Exalate commented:

wochinge commented:

Enable has done a spike on conversation testing and then benched it for 3.0 after talking to a few customers as we've realized that it doesn't make sense to do something hacky which doesn't work longterm. What I know from @TyDunn is that it's gonna one of the two key things for 3.1

wochinge avatar Sep 17 '21 08:09 wochinge

Exalate commented:

rgstephens commented:

Is this still slated for 3.1? I'm doing a lot of manual testing with rasa shell --debug because of this and #9013

rgstephens avatar Nov 22 '21 17:11 rgstephens

Exalate commented:

m-vdb commented:

It's currently planned for next year and unclear in which release it will go into. This is an issue we want to address holistically with other pain points our customers are seeing. Stay tuned!

m-vdb avatar Nov 23 '21 08:11 m-vdb

Hello :) Is it maybe possible to get an estimate when we could expect solution to this issue? It just occurred to me that the values of slow_was_set are not being considered in test stories (neither is slot extraction or validation inside forms) and as I understand this boils down to being able to test Rasa with running actions server. Knowing the potential time frame could help me decide if writing a custom solution is worth the effort.

Nummulit avatar May 31 '22 13:05 Nummulit

As a workaround, you'll find a pytest example in the financial-demo here.

rgstephens avatar May 31 '22 14:05 rgstephens

Closing this issue as we're planning more discovery on this topic in the next months

m-vdb avatar Dec 07 '22 14:12 m-vdb

Believe it or not: this feature is absolutely necessary! I cannot believe that it's silently closed more than 2 years after initially opening and where people are clearly stating that this is an important feature. There is such a pain with testing Rasa because of that. At least you could have transferred this issue over to Jira. For what should these issues be good for otherwise ?

lumpidu avatar Jan 24 '23 16:01 lumpidu

@lumpidu thanks for your comment, and sorry for the miscommunication on my end. I had written this above:

Closing this issue as we're planning more discovery on this topic in the next months

We're continuing internal work on this, and you should hear a few updates in the next months. I'll see about re-creating a Jira ticket in the OSS backlog. Thanks a lot for your interest!

m-vdb avatar Jan 31 '23 11:01 m-vdb