openverse Prevent Django Admin default queries on primary media tables in production

Description

On several occasions now, we've run into production resource issues with our database due to Django Admin performing some default query against our (massive) primary media tables (e.g. #970).

Recent changes to the reporting admin (specifically #4254) have added new views that are also selecting the primary media tables in order to generate the UI. These are incredibly useful for local testing, but are actively harmful for production. ~~We should alter the logic for these fields so that the automatic completion only occurs locally and does not occur in production. This will help prevent issues from occurring from simply visiting pages in Django Admin in production.~~

Going forward (per @sarayourfriend's comment https://github.com/WordPress/openverse/issues/4344#issuecomment-2146265927), we should create a custom object manager for the media models that ensures the following things:

ORDER BY is not used on the base media tables
A LIMIT is always defined for queries on the media tables

This would prevent us from erroneously adding naive queries that could impact production in the future, even if we add the functionality for the query in a naive way.

May 16 '24 18:05 AetherUnbound

Reopening to signify my request for clarification about the status of this issue and the solution implemented in https://github.com/WordPress/openverse/pull/4349#issuecomment-2121456363

May 21 '24 00:05 sarayourfriend

@sarayourfriend can we close this issue, or did you want to keep it open until we can remove the functions that the associated issue created? I think #4386 maybe have also helped accomplish this.

Jun 03 '24 19:06 AetherUnbound

I want to prevent select * from <media> without a limit. There is literally no use case for it in the API and it is trivial to cause a problem in production by doing that, even without knowing (e.g., Django admin forms).

I think it could be accomplished by creating a custom object manager for the media models that prevents selecting more than the maximum number of rows we need for search.

To me the problem isn't with the implementation of the Django admin forms we have now. The problem is that it's even possible. I want to make it impossible to cause this problem in production, not just something we try to cautiously avoid.

Jun 03 '24 23:06 sarayourfriend

FWIW I think what's more destructive, even if a limit is provided, is select * from <media> order by <column>, since the order by clause is what causes very heavy operations on the database. Just noting that for the eventual work on this ticket. Agreed that a system-level block for that would be ideal!

Jun 04 '24 18:06 AetherUnbound

Sounds good! Whatever the features we shouldn't use on those tables, the object manager can block them. Can you update the issue body with what you think those features that need to be blocked are? Is it just order by?

Jun 04 '24 22:06 sarayourfriend

I was looking into this and although I found that creating a custom object manager for limiting the number of rows returned works for the initial query, and even though it still returns a QuerySet, we cannot used it.

class MediaManager(models.Manager):
    def get_queryset(self):
        return super().get_queryset()[:5]  # example number

The documentation says:

Further filtering or ordering of a sliced queryset is prohibited due to the ambiguous nature of how that might work.

We have to resort to being alert to possible queries that can get out of control. We also can't remove the sort statements because they are used in admin views. If anyone has any alternatives I'd be happy to hear them. For now, I'm closing this issue.

Nov 18 '24 23:11 krysal

@krysal the only alternative I can think of that would accomplish this is to subclass QuerySet and enforce a limit in _fetch_all... or use it to set a custom subclass of ModelIterable onto QuerySet._iterable_class and check that the queryset.query has limits set in ModelIterable.__init__. Neither seem optimal, and I'm not sure what the side effects would be. Without interfering with existing library code, you could also log an error message when no limit is set on those models rather than prevent it altogether, so maintainers can easily identify when it happens. Otherwise, unless there are plans to implement query monitoring to alert on long queries (and maybe auto-kill them), it's probably prudent to put some kind of check in place to prevent it.

Nov 19 '24 02:11 sarayourfriend

openverse openverse copied to clipboard

Prevent Django Admin default queries on primary media tables in production

Description

openverse
openverse copied to clipboard