fix: prevent caching carousel errors when Solr fails
Fixes #11373
Problem
When Solr fails, carousel error responses were cached for 5 minutes, causing all users to see empty carousels even after Solr recovered.
Solution
Add conditional caching logic to detect and skip caching error responses. Failed requests now trigger fresh retries instead of serving cached errors.
Files Changed
openlibrary/core/cache.py
Added optional cacheable parameter to memcache_memoize:
- Callback function that determines if a value should be cached
- Returns
Trueto cache,Falseto skip - Wrapped in try/except for fail-safe behavior
openlibrary/plugins/upstream/utils.py
Added carousel validation logic:
-
is_cacheable_carousel_content()- checks rendered HTML for error indicators, missing carousel structure, or empty content -
is_cacheable_macro()- applies carousel validation only to carousel macros, default caching for others - Connected validation to
memcache_memoizeviacacheableparameter
openlibrary/plugins/worksearch/code.py
Added Solr error handling:
- Check for
response.errororresponse.raw_resp is Nonebefore processing - Return structured error dict instead of crashing with TypeError
- Log errors for monitoring
openlibrary/tests/core/test_cache.py
Added comprehensive tests:
-
test_cacheable_callback()- verifies normal values cached, errors not cached -
test_cacheable_callback_edge_cases()- tests None, empty, partial data handling -
test_cacheable_callback_exception_handling()- verifies fail-safe behavior
Testing
pytest openlibrary/tests/core/test_cache.py -v
All tests pass. No breaking changes to existing functionality.
Checklist
- [x] Fixes reported issue
- [x] Prevents caching on errors
- [x] Adds tests to verify behavior
- [x] No breaking changes
- [x] Fail-safe error handling
Stakeholders
@jimchamp
PTAL @jimchamp, I’ve tried my best to fix the issue. Let me know if anything else needs attention.
The first step to solving #11373 is to reproduce the bug. Please explain how you reproduced it.
@jimchamp I reproduced the bug by targeting /partials.json (the actual carousel endpoint that internally calls render_cached_macro(), not the homepage HTML which only contains placeholders).
Reproduction: Stop Solr → Clear cache → Fetch /partials.json?macro=ReadingLogCarousel → Empty response gets cached via render_cached_macro() → Restart Solr → Bug: Empty carousel persists for 5 minutes despite Solr being healthy.
Local verification: I created reproduction scripts (reproduce_issue.py and test_carousel_cache_bug.sh) that demonstrate the issue. The scripts confirm that render_cached_macro() caches the error response, aligning with the issue description.
Production impact: When Solr experiences issues, all users see broken carousels even after recovery. My fix prevents caching error responses by adding conditional caching logic to memcache_memoize() and validating carousel content in render_cached_macro().
The reproduction scripts are local testing tools and not included in the PR. Would you like me to share them for verification, or would you prefer to review the fix implementation directly?
If my reproduction approach isn't good, please let me know.
The issue here is that our caching system is currently undergoing a major refactor. These changes will likely conflict with that effort.
This can be done without modifying cache.py. I've added some notes about this here.
We ask that contributors request to be assigned to issues before working on them. This gives contributors an opportunity to ask questions about issue before writing code that may not solve the issue in a satisfactory way. Please start doing this before working on future issues.