crates.io icon indicating copy to clipboard operation
crates.io copied to clipboard

Improve performance of the reverse dependencies endpoint

Open pietroalbini opened this issue 4 years ago • 4 comments

Me and Justin spent some time a couple days ago trying to figure out why the reverse dependencies endpoint is so slow for crates with a large number of reverse dependencies. This is the EXPLAIN ANALYZE of that query.

The goal of the query is to return all crates whose last version depends on the current crate. As the latest version is not stored anywhere the query is doing a subquery that sorts all the versions of each crate by their semver, which is really slow.

We should denormalize the latest version ID (semver-wise) in the crates table, and change the query to remove the subquery. After removing the subquery the query should be small enough to be converted to Diesel. Ideally denormalizing the latest version ID and updating the query should be done in separate PRs.

pietroalbini avatar May 27 '21 16:05 pietroalbini

Note that this is practically reversing what we did in #592, but this is now starting to actually impact our performance.

pietroalbini avatar May 27 '21 17:05 pietroalbini

Is it possible to use an index instead of a column? That would avoid issues with the column getting out of sync.

jyn514 avatar Jun 27 '21 21:06 jyn514

I'm pretty sure that query is too complex to go on an index. A materialized view might work depending on how long it takes for it to update.

pietroalbini avatar Jun 28 '21 08:06 pietroalbini

Yeah, played around with it and indexes don't seem to help much. This query though takes around 3.2s (explain) on the primary db:

SELECT DISTINCT ON (crate_id) crate_id, id
FROM versions
WHERE NOT yanked
ORDER BY crate_id, to_semver_no_prerelease(num) DESC NULLS LAST;

We could turn it into a materialized view we refresh every time a new crate is published/yanked/unyanked/deleted, but I'm worried its speed will continue to decrease as time goes by. Another option to investigate would be to store the parsed semver in the database, without calling to_semver_no_prerelease in the database which is just slow.

pietroalbini avatar Jun 28 '21 09:06 pietroalbini

I'm happy to report that we've made significant progress on this issue. After applying #8737 , the average query execution time for serde has been reduced from 2.5s to 500ms in production.

eth3lbert avatar May 29 '24 17:05 eth3lbert