addons
addons copied to clipboard
Investigate taar and taar-lite powered API endpoints performance
This is probably out of our control, but taar and taar-lite powered API endpoints are considerably slower than they used to be:

This is probably a combination of their migration to GCP and/or changes on their side, but we should investigate to find out if there is something we did that caused this, and what we could do to improve performance regardless of the cause.
┆Issue is synchronized with this Jira Task
I am wondering if those endpoints are slower because they "fallback" more often than before, WDYT? We've seen lots of errors coming from the TAAR service in Sentry over the last few months.
~~Yes, that's my theory as well. It's likely we're erroring more or just waiting for a response until the timeout more than before, and that causes the slowdown.~~ See also https://bugzilla.mozilla.org/show_bug.cgi?id=1668614 which could be related.
we could reduce the timeout to increase our API performance? (at the expense of getting even more timeouts)
Shouldn't we increase the timeout instead so that we give more time to TAAR to reply? (which should still be faster than our fallback code?)
Looking more closely at Sentry, the timeout is rarely reached now - it used to be worse. It's set to one second so it doesn't really matter here I think - the slowness is probably for all requests, even successful ones. It might be because of the fallback but I suspect this is more on the taar side, partly because taar is on GCP and we're not.
I created a dashboard to monitor performance of all "external" services in AMO. It doesn't distinguish between taar and taar lite (we use the same statsd timer) but it clearly shows perf getting worse in the end of September:

Perf hit does coincide with the date we switched TAAR to GCP (09/24). That cost us ~200ms per call, which is not great, but I still can't explain the second spike in the DiscoveryViewSet graph. It doesn't show up on the graph that monitors the requests to TAAR, and doesn't show up on TAAR graphs themselves
No changes to DiscoveryViewSet (and no other obvious changes either) in https://github.com/mozilla/addons-server/compare/2020.10.22-1...2020.10.29 - assuming that's the date on the chart for the spike.
Revisiting this issue with some profiling data on dev.
We can see that for both the call to taar-lite and taar... it's happening twice! That's because we're calling https://<host>/<prefix>/<guid-or-client-id> - no trailing slash - and it redirects to https://<host>/<prefix>/<guid-or-client-id>/ with the trailing slash... So time it takes to get the answer is doubled everytime...
Actually that's wrong - in both cases, we are ending with a /. Sometimes taar does take a while to answer...
Sentry issue: ADDONS-SERVER-PROD-7B
See also https://github.com/mozilla/addons/issues/8118
See also https://github.com/mozilla/taar/issues/113
Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-48