bedrock icon indicating copy to clipboard operation
bedrock copied to clipboard

Can we optimise our 404 handling now we're in a Wagtail-backed era?

Open stevejalim opened this issue 7 months ago • 0 comments

Description

www-dev is now running against a postgres database replica of the main CMS DB. However, we're seeing some load-related issues where drive-by scans of the site are triggering a lot of 404s, which is is having DB impact.

One aspect of this (but quite possibly not the entire reason for the DB load) is when scans include URLs that 404.

Because of the way Wagtail has to be integrated into a Django project, Wagtail's URL routes are the very last ones checked.

So, a request for /some/path/that/will/never/match/ is checked against all of the Django-defined URLs and then goes into wagtail.views.serve which tries to find a page whose chain of parent slugs match some <- path <- that <- will <- never <- match. This hits the database, which - under heavy load - can cause connection saturation and pod cycling.

Looking at the data, we could (if we're not already) enable RFC-compliant 404 cacheing, but many of the 404s can't be cached at the CDN because many of them are unique enough to miss or expire from the cache before next time.

Proposed solution

  1. Every time the published page tree changes (observed via Django Signals or Wagtail Hooks), get that tree from Wagtail
  2. Extract all the paths, including definitely-in-use-for-that-path lang-code prefixes
  3. Park these in a lookup table (likely stored in Redis or Memcache, because an in-memory set() would need each pod to calculate itself, which will get more expensive over time)
  4. Slot in a check before wagtail.views.serve is called: if request.path is NOT in the lookup table, raise a 404 there, before we go into Wagtail's code

As for how to slot in the check: we add a decorator around include('wagtail_urls') - something like

from .utils.decorators import pre_check_cms_paths

decorated_wagtail_urls = pre_check_cms_paths(wagtail_urls.urlpatterns, log_request)

urlpatterns = i18n_patterns(
    path("", include((decorated_wagtail_urls, 'wagtail'), namespace='wagtail')),
)

Challenges

This list may grow as we explore more.

  • RouteablePageMixin is, according to a Wagtail Core dev, likely to spoil this plan a bit. Need to explore exactly why.

Success Criteria

  • [ ] We have a stable and performant way to pre-empt a Wagtail-raised 404.
  • [ ] It has robust test coverage

Stretch Criteria

  • [ ] The solution is easy for other teams to use, too
  • [ ] It's available a Python package

stevejalim avatar Jun 26 '24 13:06 stevejalim