rally icon indicating copy to clipboard operation
rally copied to clipboard

Support user workflows

Open danielmitterdorfer opened this issue 4 years ago • 0 comments

Scope

So far Rally was able issue individual API requests against Elasticsearch and measure request metrics. In this issue we describe how Rally should be extended to support benchmarking entire user workflows which allows for more realistic benchmarks based on end user experience.

Example Scenario

Consider a user that interacts with Kibana's "Discover" view. They might execute the following steps:

  1. Open discover view
  2. Expand the date range
  3. Add a custom filter

Each of these steps might lead to multiple requests against Elasticsearch and in between steps it takes the user a little while to inspect the returned data ("user think time"). When running a benchmark we want to specify:

  1. How many users are executing a workflow concurrently and at what rate (target throughput).
  2. The steps that are executed (As it is likely in real-world cases that one step in the UI leads to multiple requests against Elasticsearch, the composite operation type is usually suitable to model a single step).
  3. The user think time. In the first step this could be a constant but we might want to allow to vary it based on a distribution.

We should report the following metrics:

  • Achieved throughput for the entire workflow. Reporting throughput for individual steps is not helpful as throttling is happening on workflow level and thus the throughput of individual steps is determined by service time.
  • Service time, latency and error rate for the entire workflow.
  • Service time and error rate per step.

Subtasks

In order to support this scenario we need to extend Rally in the following areas:

  • Introduce a new concept of a "workflow" for tracks. In the domain model, a workflow will be a special kind of task. It can contain other tasks (the individual steps that a user executes) but a workflow can also be nested in a parallel element to allow running multiple of them concurrently.
  • Introduce a concept of user think time. At this point it is not certain whether we need to treat this as an individual concept or whether we just inject sleep tasks in between steps with a track processor based on the specified user think time.
  • Enhance the load generator to handle workflows. Implementation note: If we enhance ScheduleHandle to handle workflows, it can probably just emit the individual workflow steps and AsyncExecutor can stay mostly untouched which would reduce a lot of complexity.
  • Add workflow information to metrics documents.
  • Enhance reporting to include workflows. See below for some possibilities how that might look like.

Example schedule

The following schedule executes a workflow named discover with the two steps open-discover and filter-discover. In between the two steps we include a user think time of two seconds. In this example it is modelled explicitly as a task but this could be an implementation detail.

{
  "schedule": [
    {
      "workflow": {
        "name": "discover",
        "target-interval": 10,
        "steps": [
          {
            "name": "open-discover",
            "operation-type": "composite"
          },
          {
            "name": "think",
            "operation": {
              "operation-type": "sleep",
              "amount": 2
            }
          },
          {
            "name": "filter-discover",
            "operation-type": "composite"
          }
        ]
      }
    }
  ]
}

Reporting - Possibility 1: Workflow not mentioned

In this example there is no mention of the workflow whatsoever. This should still be the default if no workflows are defined on a track.

Metric Task Value Unit
Min Throughput open-discover 32.03 ops/s
Mean Throughput open-discover 32.03 ops/s
Median Throughput open-discover 32.03 ops/s
Max Throughput open-discover 32.03 ops/s
100th percentile latency open-discover 62.5124 ms
100th percentile service time open-discover 5.56526 ms
100th percentile processing time open-discover 6.02558 ms
error rate open-discover 0 %
Min Throughput filter-discover 129.62 ops/s
Mean Throughput filter-discover 129.62 ops/s
Median Throughput filter-discover 129.62 ops/s
Max Throughput filter-discover 129.62 ops/s
100th percentile latency filter-discover 15.382 ms
100th percentile service time filter-discover 3.93185 ms
100th percentile processing time filter-discover 4.85909 ms
error rate filter-discover 0 %

Reporting - Possibility 2: Workflow is mentioned implicitly

Here we add the workflow name before the task name.

Metric Task Value Unit
Min Throughput discover - open-discover 32.03 ops/s
Mean Throughput discover - open-discover 32.03 ops/s
Median Throughput discover - open-discover 32.03 ops/s
Max Throughput discover - open-discover 32.03 ops/s
100th percentile latency discover - open-discover 62.5124 ms
100th percentile service time discover - open-discover 5.56526 ms
100th percentile processing time discover - open-discover 6.02558 ms
error rate discover - open-discover 0 %
Min Throughput discover - filter-discover 129.62 ops/s
Mean Throughput discover - filter-discover 129.62 ops/s
Median Throughput discover - filter-discover 129.62 ops/s
Max Throughput discover - filter-discover 129.62 ops/s
100th percentile latency discover - filter-discover 15.382 ms
100th percentile service time discover - filter-discover 3.93185 ms
100th percentile processing time discover - filter-discover 4.85909 ms
error rate discover - filter-discover 0 %

Reporting - Possibility 3: Workflow is mentioned explicitly

We add a specific column for the workflow but don't report any top-level metrics.

Metric Workflow Task Value Unit
Min Throughput discover open-discover 32.03 ops/s
Mean Throughput discover open-discover 32.03 ops/s
Median Throughput discover open-discover 32.03 ops/s
Max Throughput discover open-discover 32.03 ops/s
100th percentile latency discover open-discover 62.5124 ms
100th percentile service time discover open-discover 5.56526 ms
100th percentile processing time discover open-discover 6.02558 ms
error rate discover open-discover 0 %
Min Throughput discover filter-discover 129.62 ops/s
Mean Throughput discover filter-discover 129.62 ops/s
Median Throughput discover filter-discover 129.62 ops/s
Max Throughput discover filter-discover 129.62 ops/s
100th percentile latency discover filter-discover 15.382 ms
100th percentile service time discover filter-discover 3.93185 ms
100th percentile processing time discover filter-discover 4.85909 ms
error rate discover filter-discover 0 %

Reporting - Possibility 3: Workflow is mentioned explicitly and includes metrics

Here we report top-level metrics (throughput) only for the workflow and only service time and related metrics for the individual steps.

Metric Workflow Task Value Unit
Min Throughput discover 32.03 ops/s
Mean Throughput discover 32.03 ops/s
Median Throughput discover 32.03 ops/s
Max Throughput discover 32.03 ops/s
100th percentile latency discover 62.5124 ms
100th percentile service time discover 59.56526 ms
100th percentile processing time discover 60.02558 ms
error rate discover 0 %
100th percentile service time discover open-discover 5.56526 ms
100th percentile processing time discover open-discover 6.02558 ms
error rate discover open-discover 0 %
100th percentile service time discover filter-discover 3.93185 ms
100th percentile processing time discover filter-discover 4.85909 ms
error rate discover filter-discover 0 %

danielmitterdorfer avatar Feb 10 '21 12:02 danielmitterdorfer