apm-agent-ruby
apm-agent-ruby copied to clipboard
Feature request - logging of slow requests
I've been using elastic-apm for several weeks now, and it's great at giving me high-level metrics, but it's not very useful for troubleshooting pathological cases.
I'd like elastic-apm to have a configurable runtime threshold, and "sample" any request which runs for longer than that threshold. I realize that this is not how the project is currently designed, architecturally, but samples are way more important for the slow, infrequent cases than the common fast case. When I have slow actions, I want to be able to figure out why they were slow.
Is there anything like that on the horizon?
Hi @cheald. Glad to hear that you are finding APM useful!
We already do a few things to help with what you are describing: First, there's the Impact of the request. This is the default sorting value of requests (or transactions) and it should help with finding the endpoints that are most popular and the slowest. If you just want slowest, you can sort by avg. resp. time.
Second, if what you want is to sample only the requests taking longer than x ms then you are right, we don't have that. There span_frames_min_duration
but that works on the individual span and not the transaction as a whole. Sample in this context meaning include stack trace and source code.
I haven't thought about adding a (kind of) transaction_min_duration
if this is what you are after?
I might be misunderstanding you – if so, please do tell 😄
Yeah, finding the highest impact transactions in aggregate works great - no complaints there. My desired use case here is to get telemetry on individual pathological transactions. For example, we had an issue today where a query plan got messed up, and we had a low-traffic endpoint with transactions that ended up taking several orders of magnitude more time than usual. However, it was low-traffic enough that we weren't lucky enough to get a sample of one of the slow transactions, so I couldn't use APM to figure out which query it was that causing problems.
span_frames_min_duration
This is in the right direction, but it's not exactly what I want. In this case, I want to see all spans for a transactions whose sum time is greater than X ms.
I haven't thought about adding a (kind of) transaction_min_duration if this is what you are after?
Exactly. My desired use case would be something like NewRelic's transaction_tracer.transaction_threshold setting, which causes transactions that are slower than a particular threshold to always be logged.
I realize that the agent doesn't collect span data unless it's sampling, so it's not possible to just measure the transaction time and flip that flag after the fact. I think the simplest path forward would be to sample all transactions, but to only send x% or slower than Yms upstream. This may impose an unacceptable performance overhead penalty, though.
I see your point. And you're right that there's not much to do as long as we are using the sample rate.
We could add an always_sample_endpoint_patterns option. You could add exceptions to the sample rate based on Transaction#name
or whatever makes sense (/endpoint
?). That'll take a few manual steps and isn't automatic, but at least, if you know the endpoint you want to investigate, it could help with the unlucky sample collection?
This might be a little overcomplicated, but what about something like:
- Track the execution times for all transactions
- When a transaction takes longer than a configured threshold, add it to a temporary "high interest" list, perhaps for a configured duration
- "Sample" all transactions for that controller/action for as long as it's in the "high interest" list
- Log all high-interest transactions of more than the configured threshold's runtime (or which pass the natural sample rate).
That would permit the APM client to adaptively become more aggresive for slow actions without having to incur the overhead cost of sampling everything all the time. It would miss the first slow transaction, but if it's a repeated pattern, this would make it easier to capture it.
That's exactly what I suggested in our company Slack 😄 This would be nice. But something I'd postpone to at least 1.1.
If you're feeling adventurous, PRs are very welcome!