Improve Open Library uptime
Feature Request
Open Library has had a lot of downtime over the past ~year. This past month downtime ranged from ~3min to ~130min per day -- that's 0.2% to 9% downtime (with an outlier full day of downtime due to physical hardware issues which are rare and outside of our control). Excluding the outlier, that's an average of 40min (median 32min) of ~downtime.
We would like this to go down. Perhaps a reasonable goal is to an average of ~15min per day (~1% downtime) as a first step.
Note we define "downtime" or "low availability" as a high level of 503s returned, or minutes where the monitoring service is disconnected.
We have already taken some steps to address this issue:
- The primary cause of downtime earlier in the month was due to distributed non-identifying crawlers sidestepping our rate limits and causing performance degradation. We have put in place mitigations to help catch such cases, and have seen improvements to downtime.
- The second cause of downtime was solr saturation due to expensive requests resulting in haproxy queueing. We added solr-side enforced timeouts to many of our queries, blocked certain unintentional expensive queries, and tuned solr's caching.
These changes collectively (excluding the solr timeouts which were done earlier) saw the average down time go from 49min/day to 33min/day from [11-03, 11-17] to [11-18, 12-02] (32% improvement).
The current outstanding causes of our downtime are:
- Frequent but short-lived downtime due to work saturation -- unclear with what; possible IA requests.
- Rare but longer-lasting downtime solr saturation due to other expensive queries via API.
You can see the two cases here. Note the cases where we see high 503s + haproxy queueing, but not slow solr are the first case. And cases where we also see the slow solr red band, are the second case.
Proposal
Breakdown
Related files
Refer to this map of common Endpoints: *
Requirements Checklist
Checklist of requirements that need to be satisfied in order for this issue to be closed:
- [ ]
Stakeholders
Instructions for Contributors
- Before creating a new branch or pushing up changes to a PR, please first run these commands to ensure your repository is up to date, as the pre-commit bot may add commits to your PRs upstream.
hey @cdrini , here are my proposed changes to combat longer lasting downtimes:
- set _pass_time_allowed to True in the languages.py file, I believe setting it to False is making larger queries clog up the system
- In subjects.py the current publish_year limit for a facet is -1, for bad data like typos this list could be infinitely huge and result in crashing the system, I think setting the publish_year limit to 2000 would capture all legitimate years and stop the list from growing infinitely and crashing the system.
Let me know what you think about this
@saifxyzyz please kindly wait for an issue to be triaged before submitting a PR. Otherwise, our automations do not assign Assignees and otherwise we can't even vouch that an issue is something we're planning to work on.
@mekarpeles apologies, will keep in mind
@cdrini we may want to move this into a 2026 goals doc as it's difficult to act on here, as is. Making you the lead so we can figure next steps.