apm Ability to specify custom rules to set `event.outcome`

Is your feature request related to a problem? Please describe. A transaction representing a service invocation success or failure can't always be defined by a return status. Sometimes, HTTP response code 2xx is not a sufficient indication of success and there is a need to explore additional attributes to be able to confidently set event.outcome correctly. The same applies for designated a transaction as a failure.

Describe the solution you'd like As a service developer, I want to be able to define more complex rules to set event.outcome to success or failure based on the additional information either available as transaction attributes, or through an API.

Describe alternatives you've considered Using public APIs is possible, but negates the value of autodetection completely.

Additional context Without this ability, the error rate indication in the APM UI is not showing the correct picture of failures.

Aug 11 '21 06:08 michaelhyatt

Our tracing specs, both for spans and transactions, already encourage (but not mandate) agents to expose such an API.

The Java agent, for example, offers the setOutcome, which I believe is what you are looking for. If this is required for a specific agent that doesn't offer such API, it's better to open an issue in the corresponding agent repo.

Using public APIs is possible, but negates the value of autodetection completely.

This is right, but since you can do that wherever you want in your code, most likely you have the relevant info (e.g. status code), right?

Aug 11 '21 07:08 eyalkoren

Thanks @eyalkoren

This and other use cases can be solved by allowing callback functions to be executed at different stages of a Transaction and Span lifecycles. This feature should help enable some additional useful use cases such as defining event.outcome and collecting additional custom metadata (attributes/labels) while not changing the original application code relying on auto instrumentation.

Aug 11 '21 22:08 michaelhyatt

OK, I read it as if a public API for setting the outcome is a valid alternative that you may not be aware of.

Aug 12 '21 03:08 eyalkoren

I just created https://github.com/elastic/apm/issues/489 and then stumbled upon this which sounds like exactly the same. I was recently talking to @felixbarny, @axw and @Mpdreamz about the ability to customize the logic for setting event.outcome=failure without requiring manual instrumentation.

My proposal is to have a config option for configuring this. This could be in the form of a regex that matches against http.response.status_code. Eg ^([4-9]\d{2})$

Aug 17 '21 10:08 sorenlouv

allowing callback functions to be executed at different stages of a Transaction and Span lifecycles

@michaelhyatt several agents have the concept of filters or processors which sounds like that may solve the use case.

Could you describe some of the use cases for that? Why is the built-in logic not good enough in some cases? What are some concrete custom rules you'd want to add?

@sqren while closely related, the use case you're mentioning is a little different in that it's not about adding arbitrarily complex rules. Instead, it's about whether or not client-side errors should be added to the error rate calculation. A config for HTTP status codes is a bit too limiting, I think because it wouldn't work for protocols like gRPC, for example. But we could have a boolean flag that determines whether client-side errors should be considered a success or failure. However, for that use case, it may be enough to re-introduce the throughput by result (HTTP 2xx, HTTP 4xx, HTTP 5xx).

Aug 17 '21 12:08 felixbarny

we could have a boolean flag that determines whether client-side errors should be considered a success or failure.

I think this would be a great start.

Aug 17 '21 13:08 sorenlouv

I guess my PoV is really rooted in seeing a lot of the implementations that show error rates as blanks despite having errors. So, I am thinking about fixing the way we capture event.outcome in a declarative and easy way with good sensible defaults is the first step. Then, we can probably enhance it with more complex logic to determine whether the event is an error and determining the type of the error.

Examples:

Aug 18 '21 04:08 michaelhyatt

The empty error rate chart could mean that you're using an old version of the agent that doesn't set the outcome yet. I assume these services are talking over gRPC?

More generally, in some cases, it can be a valid situation that the agent captures errors but that the transaction error rate is 0. In other words, even if an error is captured within a transaction, the whole transaction may still be successful. If the server just logs an error and takes compensating actions, the transaction is not necessarily a failure. For example, if you're requesting a product page in an online shop and the call to the recommendation service fails, the page may still render fine with a 200.

Aug 18 '21 07:08 felixbarny

@felixbarny yes, some of them are gRPC and the agents can be a bit old.

Aug 18 '21 07:08 michaelhyatt

Assuming that updating the agents fix the issue, would you still need custom rules to set the outcome? Do you have specific use cases in mind? Do the processor/filter APIs and the set outcome APIs fulfill your needs? If not, why and what would we need to change in your view?

Aug 18 '21 07:08 felixbarny

Hi folks! @sqren dropped me a link to this because Dream Machine is also interested in having additional response codes be marked as failures, and he and Casper were so kind as to hop on a Zoom with me recently to watch me stumble around APM, teach me a few things, and collect feedback/observations, and this was something that came up.

I think if I wanted to do this on my own, I'd be looking at writing up some new Express middleware function that I could find a way get inserted near the end of the request chain (after the work is done, but before APM sends the transaction data) so that I could examine the response and decide to manually override the event outcome.

I could see Soren's regex concept being a great bandage to get some form of this capability out there, but I think if you can let the user define their own evaluation function that APM coordinates running at an appropriate timing during the request chain, then it gets really nice.

I could see it working by optionally defining a evaluateTransactionOutcome or finalizeTransaction function that I can pass in as a configuration field when I call apm.start() so that instead of me having to plumb this into some middleware, APM provides the hook for me to just define that middleware function, and it takes care of when to run that middleware.

The power of letting the user have a function instead of just defining it by codes, is that the developer can now consider the status code AND something else about the state of the response in deciding success or failure. That said, for now my use case feels strictly based on the code value, but I could imagine having a more complex decision tree at some point.

Aug 20 '21 13:08 andymjames

so that I could examine the response and decide to manually override the event outcome.

Would you need the web framework's response object for that or are the properties that APM collects about the response enough? If you need the response object, I think writing a custom Express middleware function is actually the way to go. If you just need the APM transaction, you could either also use a middleware function or a transaction filter.

That said, for now my use case feels strictly based on the code value

For you current use case, how would you like the outcome to be mapped? Is it just that 4xx should count as an error for the server-side transaction (not just 5xx)? Just being curious, what's the specific cause of the errors in your app? Are you using the RUM agent? If so, are the RUM transactions marked as failure?

One of the reasons why we didn't map 4xx to failure is that it's not the server's "fault" and a rogue client could just artificially increase the error rate by calling a non-existing API (404). Also, the transaction rate graph used to show the throughput grouped by HTTP 2xx/4xx/5xx which lets you also analyze 4xx specifically. If the client app is traced, it would probably show an increased error rate, unless it takes a compensating action. For example, first tries to call a new version of an endpoint, such as /orders/v2 and fall back to /orders/v1 if v2 is not yet available.

I'm not trying to dismiss your use case, just trying to understand it better, explain the reasoning of the current state, and see if the existing APIs are already covering your use case 🙂

Aug 24 '21 15:08 felixbarny