rally Support user workflows

Support user workflows

Open danielmitterdorfer opened this issue 4 years ago • 0 comments

Scope

So far Rally was able issue individual API requests against Elasticsearch and measure request metrics. In this issue we describe how Rally should be extended to support benchmarking entire user workflows which allows for more realistic benchmarks based on end user experience.

Example Scenario

Consider a user that interacts with Kibana's "Discover" view. They might execute the following steps:

Open discover view
Expand the date range
Add a custom filter

Each of these steps might lead to multiple requests against Elasticsearch and in between steps it takes the user a little while to inspect the returned data ("user think time"). When running a benchmark we want to specify:

How many users are executing a workflow concurrently and at what rate (target throughput).
The steps that are executed (As it is likely in real-world cases that one step in the UI leads to multiple requests against Elasticsearch, the composite operation type is usually suitable to model a single step).
The user think time. In the first step this could be a constant but we might want to allow to vary it based on a distribution.

We should report the following metrics:

Achieved throughput for the entire workflow. Reporting throughput for individual steps is not helpful as throttling is happening on workflow level and thus the throughput of individual steps is determined by service time.
Service time, latency and error rate for the entire workflow.
Service time and error rate per step.

Subtasks

In order to support this scenario we need to extend Rally in the following areas:

Introduce a new concept of a "workflow" for tracks. In the domain model, a workflow will be a special kind of task. It can contain other tasks (the individual steps that a user executes) but a workflow can also be nested in a parallel element to allow running multiple of them concurrently.
Introduce a concept of user think time. At this point it is not certain whether we need to treat this as an individual concept or whether we just inject sleep tasks in between steps with a track processor based on the specified user think time.
Enhance the load generator to handle workflows. Implementation note: If we enhance ScheduleHandle to handle workflows, it can probably just emit the individual workflow steps and AsyncExecutor can stay mostly untouched which would reduce a lot of complexity.
Add workflow information to metrics documents.
Enhance reporting to include workflows. See below for some possibilities how that might look like.

Example schedule

The following schedule executes a workflow named discover with the two steps open-discover and filter-discover. In between the two steps we include a user think time of two seconds. In this example it is modelled explicitly as a task but this could be an implementation detail.

{
  "schedule": [
    {
      "workflow": {
        "name": "discover",
        "target-interval": 10,
        "steps": [
          {
            "name": "open-discover",
            "operation-type": "composite"
          },
          {
            "name": "think",
            "operation": {
              "operation-type": "sleep",
              "amount": 2
            }
          },
          {
            "name": "filter-discover",
            "operation-type": "composite"
          }
        ]
      }
    }
  ]
}

Reporting - Possibility 1: Workflow not mentioned

In this example there is no mention of the workflow whatsoever. This should still be the default if no workflows are defined on a track.

Metric	Task	Value	Unit
Min Throughput	open-discover	32.03	ops/s
Mean Throughput	open-discover	32.03	ops/s
Median Throughput	open-discover	32.03	ops/s
Max Throughput	open-discover	32.03	ops/s
100th percentile latency	open-discover	62.5124	ms
100th percentile service time	open-discover	5.56526	ms
100th percentile processing time	open-discover	6.02558	ms
error rate	open-discover	0	%
Min Throughput	filter-discover	129.62	ops/s
Mean Throughput	filter-discover	129.62	ops/s
Median Throughput	filter-discover	129.62	ops/s
Max Throughput	filter-discover	129.62	ops/s
100th percentile latency	filter-discover	15.382	ms
100th percentile service time	filter-discover	3.93185	ms
100th percentile processing time	filter-discover	4.85909	ms
error rate	filter-discover	0	%

Reporting - Possibility 2: Workflow is mentioned implicitly

Here we add the workflow name before the task name.

Metric	Task	Value	Unit
Min Throughput	discover - open-discover	32.03	ops/s
Mean Throughput	discover - open-discover	32.03	ops/s
Median Throughput	discover - open-discover	32.03	ops/s
Max Throughput	discover - open-discover	32.03	ops/s
100th percentile latency	discover - open-discover	62.5124	ms
100th percentile service time	discover - open-discover	5.56526	ms
100th percentile processing time	discover - open-discover	6.02558	ms
error rate	discover - open-discover	0	%
Min Throughput	discover - filter-discover	129.62	ops/s
Mean Throughput	discover - filter-discover	129.62	ops/s
Median Throughput	discover - filter-discover	129.62	ops/s
Max Throughput	discover - filter-discover	129.62	ops/s
100th percentile latency	discover - filter-discover	15.382	ms
100th percentile service time	discover - filter-discover	3.93185	ms
100th percentile processing time	discover - filter-discover	4.85909	ms
error rate	discover - filter-discover	0	%

Reporting - Possibility 3: Workflow is mentioned explicitly

We add a specific column for the workflow but don't report any top-level metrics.

Metric	Workflow	Task	Value	Unit
Min Throughput	discover	open-discover	32.03	ops/s
Mean Throughput	discover	open-discover	32.03	ops/s
Median Throughput	discover	open-discover	32.03	ops/s
Max Throughput	discover	open-discover	32.03	ops/s
100th percentile latency	discover	open-discover	62.5124	ms
100th percentile service time	discover	open-discover	5.56526	ms
100th percentile processing time	discover	open-discover	6.02558	ms
error rate	discover	open-discover	0	%
Min Throughput	discover	filter-discover	129.62	ops/s
Mean Throughput	discover	filter-discover	129.62	ops/s
Median Throughput	discover	filter-discover	129.62	ops/s
Max Throughput	discover	filter-discover	129.62	ops/s
100th percentile latency	discover	filter-discover	15.382	ms
100th percentile service time	discover	filter-discover	3.93185	ms
100th percentile processing time	discover	filter-discover	4.85909	ms
error rate	discover	filter-discover	0	%

Reporting - Possibility 3: Workflow is mentioned explicitly and includes metrics

Here we report top-level metrics (throughput) only for the workflow and only service time and related metrics for the individual steps.

Metric	Workflow	Task	Value	Unit
Min Throughput	discover		32.03	ops/s
Mean Throughput	discover		32.03	ops/s
Median Throughput	discover		32.03	ops/s
Max Throughput	discover		32.03	ops/s
100th percentile latency	discover		62.5124	ms
100th percentile service time	discover		59.56526	ms
100th percentile processing time	discover		60.02558	ms
error rate	discover		0	%
100th percentile service time	discover	open-discover	5.56526	ms
100th percentile processing time	discover	open-discover	6.02558	ms
error rate	discover	open-discover	0	%
100th percentile service time	discover	filter-discover	3.93185	ms
100th percentile processing time	discover	filter-discover	4.85909	ms
error rate	discover	filter-discover	0	%

Feb 10 '21 12:02 danielmitterdorfer

rally rally copied to clipboard

Support user workflows

Scope

Example Scenario

Subtasks

Example schedule

Reporting - Possibility 1: Workflow not mentioned

Reporting - Possibility 2: Workflow is mentioned implicitly

Reporting - Possibility 3: Workflow is mentioned explicitly

Reporting - Possibility 3: Workflow is mentioned explicitly and includes metrics

rally
rally copied to clipboard