warehouse Avoid n+1 queries on Release.description

Avoid n+1 queries on Release.description

Open dstufft opened this issue 1 year ago • 1 comments

@miketheman noticed that the file upload view was executing a query for every single existing version that the uploading project has ever released. For projects that have been around awhile, that can mean thousands of queries.

To make matters worse, the query it was loading was loading the Release.description column, which is a relation that points to a very large table that contains the long description of each release (both in it's raw form and in the rendered form), so it was loading our largest table, sometimes thousands of times in a single request.

Attempting to use raiseload or lazy="raise" on Release.description had no effect on the number of queries being executed, and reading through the SQLAlchemy docs, this appears to be expected if SQLAlchemy needs to execute that query to implement it's Unit of Work pattern.

Since it was triggering on every release, I guessed that it must be related to something causing SQLAlchemy to think every release was modified, so I started looking at this code:

releases = (
    request.db.query(Release)
    .filter(Release.project == project)
    .options(
        orm.load_only(Release.project_id, Release.version, Release._pypi_ordering)
    )
    .all()
)
for i, r in enumerate(
    sorted(releases, key=lambda x: packaging_legacy.version.parse(x.version))
):
    r._pypi_ordering = i

And indeed, when I removed this code completely, the N+1 queries went away.

It appears that SQLAlchemy is treating setting a column as a mutating action, even if we're setting the column to the same value that it currently holds. Since in the majority of cases we expect the r._pypi_ordering to stay the same, we can just check if the value needs to be updated, and only set the value if so.

To make matters worse, the ordering of releases can only change when a new release is created, but we were executing this sorting algorithm on every upload, so the N+1 queries were happening for every file upload rather than just the ones that could change the outcome of this query.

Technically, the N+1 queries can still occur, because it's not really N+1 queries, it's 1 query for every modified release, and the fix in this PR relies on the fact that typically people are releasing new, higher versions so the previous versions don't need their ordering number to be modified. However, if they do release a lower numbered version, then any version after that will be "shifted", and ultimately end up getting modified and trigger this behavior again.

That should hopefully be pretty rare, but if we wanted to protect against it, we could move this query out of the ORM completely and just build up a mapping of Release.id to Release._pypi_ordering and use the bulk update mechanism in SQLAlchemy "core".

Aug 09 '24 16:08 dstufft

Oh, and I have like 0 idea why updating Release._pypi_ordering, and integer column on Release makes SQLAlchemy think it needs to lazy load Release.description, a relationship to another table. I'm sure there's some reason, but I'm completely stumped as to why it would need to do that.

Aug 09 '24 16:08 dstufft

warehouse warehouse copied to clipboard

Avoid n+1 queries on Release.description

warehouse
warehouse copied to clipboard