openverse
openverse copied to clipboard
Investigate and decrease image proxy upstream request timeout
Problem
Image proxy requests time out at 15 seconds. Similar to #4507, thumbnail requests are highly susceptible to upstream provider problems. If Flickr has a bad day, we'll have a bad day because we can't generate thumbnails (or validate dead links, to tie it back to that example).
Description
We should perform a similar analysis as I did for #4507. Add logging to the thumbnail requests to time how long they take so we can analyse them. Log the time, status, url, and maybe go ahead and log the provider
slug specifically so we can more easily segment the analysis along those lines (I wish we had done so for the dead link validation).
Deploy the logs for a week and follow the same approach to analysing the times as I took for the link validation. Find out how long (average, p95, p99) it takes for successful thumbnails to come back. Find out if there are meaningful differences between providers that we could modulate the timeout based on. etc etc.
The goal is to figure out if we can lower the 15-second timeout without creating significant negative consequences. Some failed thumbnails must be accepted, but something to consider is whether if a request times out (specifically times out) if we could try increasing the timeout the next time that thumbnail gets requested, and do so progressively until we hit some maximum (whether that's 15 seconds or otherwise).
Additional context
This is mostly a way to investigate whether thumbnail timeouts really need to be so high. If they do, that's fine, but it'd be good to know.
It is not, however, likely to be an alternative to segmenting our API request timing monitors, as thumbnail requests will probably always take a little longer than our other requests, and along with search, will always be the most important to know without having the speedy outliers or volume of one or the other affect the visibility of problems with the other.