openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Improve Open Library uptime

Open cdrini opened this issue 3 weeks ago • 1 comments

Feature Request

Open Library has had a lot of downtime over the past ~year. This past month downtime ranged from ~3min to ~130min per day -- that's 0.2% to 9% downtime (with an outlier full day of downtime due to physical hardware issues which are rare and outside of our control). Excluding the outlier, that's an average of 40min (median 32min) of ~downtime.

Image

We would like this to go down. Perhaps a reasonable goal is to an average of ~15min per day (~1% downtime) as a first step.

Note we define "downtime" or "low availability" as a high level of 503s returned, or minutes where the monitoring service is disconnected.

We have already taken some steps to address this issue:

These changes collectively (excluding the solr timeouts which were done earlier) saw the average down time go from 49min/day to 33min/day from [11-03, 11-17] to [11-18, 12-02] (32% improvement).

The current outstanding causes of our downtime are:

  • Frequent but short-lived downtime due to work saturation -- unclear with what; possible IA requests.
  • Rare but longer-lasting downtime solr saturation due to other expensive queries via API.

You can see the two cases here. Note the cases where we see high 503s + haproxy queueing, but not slow solr are the first case. And cases where we also see the slow solr red band, are the second case.

Image

Proposal

Breakdown

Related files

Refer to this map of common Endpoints: *

Requirements Checklist

Checklist of requirements that need to be satisfied in order for this issue to be closed:

  • [ ]

Stakeholders


Instructions for Contributors

cdrini avatar Dec 03 '25 17:12 cdrini

hey @cdrini , here are my proposed changes to combat longer lasting downtimes:

  1. set _pass_time_allowed to True in the languages.py file, I believe setting it to False is making larger queries clog up the system
  2. In subjects.py the current publish_year limit for a facet is -1, for bad data like typos this list could be infinitely huge and result in crashing the system, I think setting the publish_year limit to 2000 would capture all legitimate years and stop the list from growing infinitely and crashing the system.

Let me know what you think about this

saifxyzyz avatar Dec 05 '25 08:12 saifxyzyz

@saifxyzyz please kindly wait for an issue to be triaged before submitting a PR. Otherwise, our automations do not assign Assignees and otherwise we can't even vouch that an issue is something we're planning to work on.

mekarpeles avatar Dec 15 '25 21:12 mekarpeles

@mekarpeles apologies, will keep in mind

saifxyzyz avatar Dec 15 '25 21:12 saifxyzyz

@cdrini we may want to move this into a 2026 goals doc as it's difficult to act on here, as is. Making you the lead so we can figure next steps.

mekarpeles avatar Dec 17 '25 20:12 mekarpeles