wp-calypso icon indicating copy to clipboard operation
wp-calypso copied to clipboard

Add more data to Calypso's statsd response_time metric

Open noahtallen opened this issue 2 years ago • 2 comments

Proposed Changes

I'd like to get some more data into our Calypso server response time dashboard. (Search for "calypso node response times" in Grafana to find it.) In particular, I'd like to visualize response time of SSR sections vs non-SSR sections and logged-in vs logged-out requests.

This is my first time looking at statsd, and my understanding is that you filter things with periods. So the current dashboard uses calypso.production.*.response_time, and * represents all the sections.

To add more "metadata", I think we can just add more things with periods. So this would extend the key to calypso.production.*.loggedin_${true|false}.ssr_${true|false}.response_time:$value

If you want to see all response times, I think the original query still works. If you want to see SSR response times, you could use calypso.production.*.ssr_true.response_time. Or for logged-in requests, calypso.production.*.loggedin_true.*.response_time.

Of course, I might be wrong about this, so feedback would be appreciated! We might need to tweak the dashboard when we deploy this. I'd also love any more suggestions for improving the dashboard.

I also took the opportunity to copy over some changes from middleware/logger:

  • use hrtime for more precise timings.
  • use on( 'close' ) to capture stale requests. (Let me know if this wouldn't be helpful to add. The current dashboard just accounts for "finished" requests, so it doesn't track anything that times out.)

Testing Instructions

TBD, looking for general feedback right now. Will update unit tests tomorrow

noahtallen avatar Sep 21 '22 00:09 noahtallen

This PR does not affect the size of JS and CSS bundles shipped to the user's browser.

Generated by performance advisor bot at iscalypsofastyet.com.

matticbot avatar Sep 21 '22 00:09 matticbot

If you want to see all response times, I think the original query still works.

I tried to verify this in the Grafana dashboard and the query doesn't work. I tried to use a single star calypso.*.response_time pattern to catch two levels, like in calypso.production.domains.response_time, and it doesn't work, there are "No data":

Screenshot 2022-09-23 at 13 43 56

I need to use two stars, calypso.*.*.response_time, to get a chart for aggregated data. Seems that adding levels inside the tree is not backward-compatible, the existing queries will break.

jsnajdr avatar Sep 23 '22 11:09 jsnajdr

Seems that adding levels inside the tree is not backward-compatible, the existing queries will break.

That's unfortunate. I was hoping at first that we could just update the main dashboard which uses this stat, but I think changing the query would break historical data on the graph.

So I guess the best route would be to create a new timer. I started exploring how to send multiple events simultaneously so that we can do this without impacting the number of events we collect. It seems we could add another "beacon" here:

https://github.com/Automattic/wp-calypso/blob/73772fb3b10d6e9e69503cb9df9f8425ed32f6c7/client/lib/analytics/statsd-utils.ts#L7-L11

But I haven't been able to find any documentation about boom.gif. In particular, do we really need the u query parameter since the same thing is already encoded into the statsd event? Currently, the entire event slug is passed to the u parameter, which doesn't make sense if we're sending multiple events at the same time, each with different slugs. Curious if you know more @jsnajdr or @griffbrad.

noahtallen avatar Sep 23 '22 21:09 noahtallen

But I haven't been able to find any documentation about boom.gif.

I see I forgot to submit this comment on Friday 🙂 https://github.com/Automattic/wp-calypso/pull/68068#discussion_r978540561 @josephscott should be the go-to person for boom.gif. The code is not in the main wpcom repository and I haven't found any docs either.

jsnajdr avatar Sep 26 '22 07:09 jsnajdr

It seems we could add another "beacon" here

Yes, more beacons should be fine.

josephscott avatar Sep 26 '22 16:09 josephscott

Added a small improvement which fixes those types and also looks at removing server analytics.statsd in favor of a small util function in client/lib/analytics/statsd-utils. It just doesn't seem like that server-side wrapper function provides much value when we can just import from the underlying library. (Though, the main reason I tried this was because if we try to introduce typescript to server/pages/analytics.js, we would have to do a lot of extra work to get things like the express Request object typed correctly. Which definitely seems worth doing in the future, but is out of scope for this PR)

I'd also like to continue testing the event name before merging. We might want to change it from loggedin_${ loggedIn }.ssr_${ usedSSRHandler }.response_time to response_time.x.x (with the "filters" after the event name.) I also wonder if we should move the calypso section to the "filter" part of the event, instead of having it come before the event name. My impression is that it's more difficult to write queries in Grafana if the static data (like the event name) does not come first. But I need to read up on best-practices for event names :)

noahtallen avatar Oct 07 '22 09:10 noahtallen