wp-calypso
wp-calypso copied to clipboard
Add more data to Calypso's statsd response_time metric
Proposed Changes
I'd like to get some more data into our Calypso server response time dashboard. (Search for "calypso node response times" in Grafana to find it.) In particular, I'd like to visualize response time of SSR sections vs non-SSR sections and logged-in vs logged-out requests.
This is my first time looking at statsd, and my understanding is that you filter things with periods. So the current dashboard uses calypso.production.*.response_time
, and *
represents all the sections.
To add more "metadata", I think we can just add more things with periods. So this would extend the key to calypso.production.*.loggedin_${true|false}.ssr_${true|false}.response_time:$value
If you want to see all response times, I think the original query still works. If you want to see SSR response times, you could use calypso.production.*.ssr_true.response_time
. Or for logged-in requests, calypso.production.*.loggedin_true.*.response_time
.
Of course, I might be wrong about this, so feedback would be appreciated! We might need to tweak the dashboard when we deploy this. I'd also love any more suggestions for improving the dashboard.
I also took the opportunity to copy over some changes from middleware/logger
:
- use hrtime for more precise timings.
- use
on( 'close' )
to capture stale requests. (Let me know if this wouldn't be helpful to add. The current dashboard just accounts for "finished" requests, so it doesn't track anything that times out.)
Testing Instructions
TBD, looking for general feedback right now. Will update unit tests tomorrow
This PR does not affect the size of JS and CSS bundles shipped to the user's browser.
Generated by performance advisor bot at iscalypsofastyet.com.
If you want to see all response times, I think the original query still works.
I tried to verify this in the Grafana dashboard and the query doesn't work. I tried to use a single star calypso.*.response_time
pattern to catch two levels, like in calypso.production.domains.response_time
, and it doesn't work, there are "No data":
data:image/s3,"s3://crabby-images/e4e7c/e4e7c1aefe775efa9b9584b27df9dd8cb2ef15ce" alt="Screenshot 2022-09-23 at 13 43 56"
I need to use two stars, calypso.*.*.response_time
, to get a chart for aggregated data. Seems that adding levels inside the tree is not backward-compatible, the existing queries will break.
Seems that adding levels inside the tree is not backward-compatible, the existing queries will break.
That's unfortunate. I was hoping at first that we could just update the main dashboard which uses this stat, but I think changing the query would break historical data on the graph.
So I guess the best route would be to create a new timer. I started exploring how to send multiple events simultaneously so that we can do this without impacting the number of events we collect. It seems we could add another "beacon" here:
https://github.com/Automattic/wp-calypso/blob/73772fb3b10d6e9e69503cb9df9f8425ed32f6c7/client/lib/analytics/statsd-utils.ts#L7-L11
But I haven't been able to find any documentation about boom.gif
. In particular, do we really need the u
query parameter since the same thing is already encoded into the statsd event? Currently, the entire event slug is passed to the u
parameter, which doesn't make sense if we're sending multiple events at the same time, each with different slugs. Curious if you know more @jsnajdr or @griffbrad.
But I haven't been able to find any documentation about boom.gif.
I see I forgot to submit this comment on Friday 🙂 https://github.com/Automattic/wp-calypso/pull/68068#discussion_r978540561 @josephscott should be the go-to person for boom.gif
. The code is not in the main wpcom
repository and I haven't found any docs either.
It seems we could add another "beacon" here
Yes, more beacons should be fine.
Added a small improvement which fixes those types and also looks at removing server analytics.statsd
in favor of a small util function in client/lib/analytics/statsd-utils. It just doesn't seem like that server-side wrapper function provides much value when we can just import from the underlying library. (Though, the main reason I tried this was because if we try to introduce typescript to server/pages/analytics.js, we would have to do a lot of extra work to get things like the express Request object typed correctly. Which definitely seems worth doing in the future, but is out of scope for this PR)
I'd also like to continue testing the event name before merging. We might want to change it from loggedin_${ loggedIn }.ssr_${ usedSSRHandler }.response_time
to response_time.x.x
(with the "filters" after the event name.) I also wonder if we should move the calypso section to the "filter" part of the event, instead of having it come before the event name. My impression is that it's more difficult to write queries in Grafana if the static data (like the event name) does not come first. But I need to read up on best-practices for event names :)