rally
rally copied to clipboard
Support user workflows
Scope
So far Rally was able issue individual API requests against Elasticsearch and measure request metrics. In this issue we describe how Rally should be extended to support benchmarking entire user workflows which allows for more realistic benchmarks based on end user experience.
Example Scenario
Consider a user that interacts with Kibana's "Discover" view. They might execute the following steps:
- Open discover view
- Expand the date range
- Add a custom filter
Each of these steps might lead to multiple requests against Elasticsearch and in between steps it takes the user a little while to inspect the returned data ("user think time"). When running a benchmark we want to specify:
- How many users are executing a workflow concurrently and at what rate (target throughput).
- The steps that are executed (As it is likely in real-world cases that one step in the UI leads to multiple requests against Elasticsearch, the
composite
operation type is usually suitable to model a single step). - The user think time. In the first step this could be a constant but we might want to allow to vary it based on a distribution.
We should report the following metrics:
- Achieved throughput for the entire workflow. Reporting throughput for individual steps is not helpful as throttling is happening on workflow level and thus the throughput of individual steps is determined by service time.
- Service time, latency and error rate for the entire workflow.
- Service time and error rate per step.
Subtasks
In order to support this scenario we need to extend Rally in the following areas:
- Introduce a new concept of a "workflow" for tracks. In the domain model, a workflow will be a special kind of task. It can contain other tasks (the individual steps that a user executes) but a workflow can also be nested in a
parallel
element to allow running multiple of them concurrently. - Introduce a concept of user think time. At this point it is not certain whether we need to treat this as an individual concept or whether we just inject
sleep
tasks in between steps with a track processor based on the specified user think time. - Enhance the load generator to handle workflows. Implementation note: If we enhance
ScheduleHandle
to handle workflows, it can probably just emit the individual workflow steps andAsyncExecutor
can stay mostly untouched which would reduce a lot of complexity. - Add workflow information to metrics documents.
- Enhance reporting to include workflows. See below for some possibilities how that might look like.
Example schedule
The following schedule executes a workflow named discover
with the two steps open-discover
and filter-discover
. In between the two steps we include a user think time of two seconds. In this example it is modelled explicitly as a task but this could be an implementation detail.
{
"schedule": [
{
"workflow": {
"name": "discover",
"target-interval": 10,
"steps": [
{
"name": "open-discover",
"operation-type": "composite"
},
{
"name": "think",
"operation": {
"operation-type": "sleep",
"amount": 2
}
},
{
"name": "filter-discover",
"operation-type": "composite"
}
]
}
}
]
}
Reporting - Possibility 1: Workflow not mentioned
In this example there is no mention of the workflow whatsoever. This should still be the default if no workflows are defined on a track.
Metric | Task | Value | Unit |
---|---|---|---|
Min Throughput | open-discover | 32.03 | ops/s |
Mean Throughput | open-discover | 32.03 | ops/s |
Median Throughput | open-discover | 32.03 | ops/s |
Max Throughput | open-discover | 32.03 | ops/s |
100th percentile latency | open-discover | 62.5124 | ms |
100th percentile service time | open-discover | 5.56526 | ms |
100th percentile processing time | open-discover | 6.02558 | ms |
error rate | open-discover | 0 | % |
Min Throughput | filter-discover | 129.62 | ops/s |
Mean Throughput | filter-discover | 129.62 | ops/s |
Median Throughput | filter-discover | 129.62 | ops/s |
Max Throughput | filter-discover | 129.62 | ops/s |
100th percentile latency | filter-discover | 15.382 | ms |
100th percentile service time | filter-discover | 3.93185 | ms |
100th percentile processing time | filter-discover | 4.85909 | ms |
error rate | filter-discover | 0 | % |
Reporting - Possibility 2: Workflow is mentioned implicitly
Here we add the workflow name before the task name.
Metric | Task | Value | Unit |
---|---|---|---|
Min Throughput | discover - open-discover | 32.03 | ops/s |
Mean Throughput | discover - open-discover | 32.03 | ops/s |
Median Throughput | discover - open-discover | 32.03 | ops/s |
Max Throughput | discover - open-discover | 32.03 | ops/s |
100th percentile latency | discover - open-discover | 62.5124 | ms |
100th percentile service time | discover - open-discover | 5.56526 | ms |
100th percentile processing time | discover - open-discover | 6.02558 | ms |
error rate | discover - open-discover | 0 | % |
Min Throughput | discover - filter-discover | 129.62 | ops/s |
Mean Throughput | discover - filter-discover | 129.62 | ops/s |
Median Throughput | discover - filter-discover | 129.62 | ops/s |
Max Throughput | discover - filter-discover | 129.62 | ops/s |
100th percentile latency | discover - filter-discover | 15.382 | ms |
100th percentile service time | discover - filter-discover | 3.93185 | ms |
100th percentile processing time | discover - filter-discover | 4.85909 | ms |
error rate | discover - filter-discover | 0 | % |
Reporting - Possibility 3: Workflow is mentioned explicitly
We add a specific column for the workflow but don't report any top-level metrics.
Metric | Workflow | Task | Value | Unit |
---|---|---|---|---|
Min Throughput | discover | open-discover | 32.03 | ops/s |
Mean Throughput | discover | open-discover | 32.03 | ops/s |
Median Throughput | discover | open-discover | 32.03 | ops/s |
Max Throughput | discover | open-discover | 32.03 | ops/s |
100th percentile latency | discover | open-discover | 62.5124 | ms |
100th percentile service time | discover | open-discover | 5.56526 | ms |
100th percentile processing time | discover | open-discover | 6.02558 | ms |
error rate | discover | open-discover | 0 | % |
Min Throughput | discover | filter-discover | 129.62 | ops/s |
Mean Throughput | discover | filter-discover | 129.62 | ops/s |
Median Throughput | discover | filter-discover | 129.62 | ops/s |
Max Throughput | discover | filter-discover | 129.62 | ops/s |
100th percentile latency | discover | filter-discover | 15.382 | ms |
100th percentile service time | discover | filter-discover | 3.93185 | ms |
100th percentile processing time | discover | filter-discover | 4.85909 | ms |
error rate | discover | filter-discover | 0 | % |
Reporting - Possibility 3: Workflow is mentioned explicitly and includes metrics
Here we report top-level metrics (throughput) only for the workflow and only service time and related metrics for the individual steps.
Metric | Workflow | Task | Value | Unit |
---|---|---|---|---|
Min Throughput | discover | 32.03 | ops/s | |
Mean Throughput | discover | 32.03 | ops/s | |
Median Throughput | discover | 32.03 | ops/s | |
Max Throughput | discover | 32.03 | ops/s | |
100th percentile latency | discover | 62.5124 | ms | |
100th percentile service time | discover | 59.56526 | ms | |
100th percentile processing time | discover | 60.02558 | ms | |
error rate | discover | 0 | % | |
100th percentile service time | discover | open-discover | 5.56526 | ms |
100th percentile processing time | discover | open-discover | 6.02558 | ms |
error rate | discover | open-discover | 0 | % |
100th percentile service time | discover | filter-discover | 3.93185 | ms |
100th percentile processing time | discover | filter-discover | 4.85909 | ms |
error rate | discover | filter-discover | 0 | % |