roadmap icon indicating copy to clipboard operation
roadmap copied to clipboard

Scaling transaction trace storage with the number of endpoints

Open itsderek23 opened this issue 7 years ago • 4 comments
trafficstars

Currently we store a max of 10 transaction traces per app, per-minute.

This has issues in the following scenarios:

  1. An app has a large number of web endpoints. For example, if an app has 1k unique endpoints and we collect 10 traces, one per-endpoint, that means we'd cover 1% of endpoints in a given minute. If an app has 100, we cover 10%.

  2. Zooming into a small time slice (ie 5 minutes) to examine an outlier. There are fewer traces to examine in a small period of time. This is more obvious when an app has a large number of endpoints.

Look at increasing the number of transaction traces we store, basing it as a percentage of the number of uniquely named endpoints in an app with a reasonable max. Initially this can just be a server-side change.

itsderek23 avatar Jun 15 '18 19:06 itsderek23

Do we have any data about the number of endpoints across apps?

dlanderson avatar Jun 18 '18 17:06 dlanderson

The histogram peak is at ~300 endpoints, trailing off pretty fast after that with only a very small handful of customers having more than 1000 endpoints.

cschneid avatar Jun 18 '18 17:06 cschneid

This change has been deployed for a couple of accounts. Additionally, we're tracking analytics on how often we return zero traces at key interactions.

From spot-checking data, I'm not seeing a significant improvement, esc. on the database query list. This may be caused by not including a "% time consumed" dimension in our trace scoring algorithm (we include the response time). In a couple of cases, I found zero traces collected over a 1-hour period for the top 8 most time-consuming (and likely the endpoints you would most want to access) in one app, for example.

Expensive queries are more likely to be called from expensive endpoints.

Two thoughts:

  1. Incorporate a "time consumed" dimension in our algorithm
  2. If zero traces are found, fetch over a longer period with a warning to the user. Return something if we can. It's common for behavior to repeat itself.

Generally: when a transaction is collected from a low-volume endpoint and the response time is fast / moderate, it's less likely to be acted upon. It's just that significant. Very slow requests are still interesting (and we account for that).

itsderek23 avatar Jun 19 '18 21:06 itsderek23

We've deployed an update to address:

Zooming into a small time slice (ie 5 minutes) to examine an outlier. There are fewer traces to examine in a small period of time. This is more obvious when an app has a large number of endpoints.

2 areas:

  1. When clicking on a db query, this increase the timeframe if no traces are found in the selected tf:

image

  1. When viewing traces on an endpoint or background job, if not zooming the tf is also increased if no traces are found:

image

itsderek23 avatar Jul 18 '18 19:07 itsderek23