addons icon indicating copy to clipboard operation
addons copied to clipboard

Investigate taar and taar-lite powered API endpoints performance

Open diox opened this issue 4 years ago • 14 comments
trafficstars

This is probably out of our control, but taar and taar-lite powered API endpoints are considerably slower than they used to be: Screenshot!UNITO-UNDERSCORE!2020-11-30 AMO Prod frontend APIs usage performance - Grafana(2) Screenshot!UNITO-UNDERSCORE!2020-11-30 AMO Prod frontend APIs usage performance - Grafana(1)

This is probably a combination of their migration to GCP and/or changes on their side, but we should investigate to find out if there is something we did that caused this, and what we could do to improve performance regardless of the cause.

┆Issue is synchronized with this Jira Task

diox avatar Nov 30 '20 11:11 diox

I am wondering if those endpoints are slower because they "fallback" more often than before, WDYT? We've seen lots of errors coming from the TAAR service in Sentry over the last few months.

willdurand avatar Dec 01 '20 10:12 willdurand

~~Yes, that's my theory as well. It's likely we're erroring more or just waiting for a response until the timeout more than before, and that causes the slowdown.~~ See also https://bugzilla.mozilla.org/show_bug.cgi?id=1668614 which could be related.

diox avatar Dec 01 '20 10:12 diox

we could reduce the timeout to increase our API performance? (at the expense of getting even more timeouts)

eviljeff avatar Dec 02 '20 13:12 eviljeff

Shouldn't we increase the timeout instead so that we give more time to TAAR to reply? (which should still be faster than our fallback code?)

willdurand avatar Dec 02 '20 13:12 willdurand

Looking more closely at Sentry, the timeout is rarely reached now - it used to be worse. It's set to one second so it doesn't really matter here I think - the slowness is probably for all requests, even successful ones. It might be because of the fallback but I suspect this is more on the taar side, partly because taar is on GCP and we're not.

diox avatar Dec 02 '20 14:12 diox

I created a dashboard to monitor performance of all "external" services in AMO. It doesn't distinguish between taar and taar lite (we use the same statsd timer) but it clearly shows perf getting worse in the end of September: Screenshot_2020-12-03 AMO External Services Perf - Grafana

diox avatar Dec 03 '20 12:12 diox

Perf hit does coincide with the date we switched TAAR to GCP (09/24). That cost us ~200ms per call, which is not great, but I still can't explain the second spike in the DiscoveryViewSet graph. It doesn't show up on the graph that monitors the requests to TAAR, and doesn't show up on TAAR graphs themselves

diox avatar Dec 03 '20 13:12 diox

No changes to DiscoveryViewSet (and no other obvious changes either) in https://github.com/mozilla/addons-server/compare/2020.10.22-1...2020.10.29 - assuming that's the date on the chart for the spike.

eviljeff avatar Dec 03 '20 14:12 eviljeff

Revisiting this issue with some profiling data on dev.

We can see that for both the call to taar-lite and taar... it's happening twice! That's because we're calling https://<host>/<prefix>/<guid-or-client-id> - no trailing slash - and it redirects to https://<host>/<prefix>/<guid-or-client-id>/ with the trailing slash... So time it takes to get the answer is doubled everytime...

diox avatar Nov 28 '22 16:11 diox

Actually that's wrong - in both cases, we are ending with a /. Sometimes taar does take a while to answer...

diox avatar Nov 28 '22 17:11 diox

Sentry issue: ADDONS-SERVER-PROD-7B

sentry[bot] avatar Jan 17 '23 10:01 sentry[bot]

See also https://github.com/mozilla/addons/issues/8118

diox avatar Jan 17 '23 10:01 diox

See also https://github.com/mozilla/taar/issues/113

diox avatar Feb 27 '23 11:02 diox

Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-48

KevinMind avatar May 03 '24 16:05 KevinMind