Tackling the Big, Bad `post_view` query
After some conversation back and forth, I was asked to continue the discussion here
Context links: https://lemmy.ml/post/27659153/17534521 https://matrix.to/#/!tUnhsBRCsiePXfsIGe:matrix.org/$OufmGMnF3ndZ2gP4Xo52qZyktt2kk8wE0hA5y1prQ10?via=matrix.org&via=mozilla.org&via=tchncs.de
====== Lemmy comment text ====== Good evening Dessalines, I have started looking at the posts query.
The lowest hanging fruit I think would be if we could replace some of the joins with WHERE EXISTS which can have a huge impact on the query time. It seems this is supported in Diesel: https://stackoverflow.com/a/74300447
This is my first time looking at the codebase so I can’t tell yet which joins are purely for filtering (in which case they can be replaced by WHERE EXISTS) and which joins need to be left in because some of their columns end up in the final SELECT
I can’t tell for sure yet but it also looks like this might also be using LIMIT...OFFSET pagination? That can be a real drag on performance but isn’t as easy to fix.
EDIT:
Looking some more, and reading some linked github discussion - I think to really get this out of the performance pits will require some denormalization like a materialized view or manual cache tables populated by triggers. I really like the ranking algorithm but so far I’m finding it difficult to optimize from a query perspective
===== Matrix Message Text ====== Good day Lemmy dev team,
Following up on this thread: https://lemmy.ml/post/27659153/17538040
I want to help out and I think there are huge performance gains to be had but I want to get a pulse check from the maintainers before I go off making big changes.
Some of my preliminary questions:
- Would there be openness to schema changes? What if the migrations have to be written in raw SQL? Are there ideological objections to triggers / cache tables / materialized views or denormalization in general?
- Would there be openness to breaking this query up? Right now it seems to handle every possible permutation of inputs and it's difficult to follow and optimize. (I'm thinking the admin/moderator view would be the first subset of inputs to break off into fully or partially separate code path)
- If we had to change FE<=> BE API around pagination for posts to get rid of
LIMIT...OFFSETpagination can we do that breaking change as part of the push to 1.0? - I see some other people poking around in the slow queries too, do they need/want to be involved?
I didn't even mention the question of testing, I think ultimately the project should have a way to bulk seed a local postgres with semi-realistic fake data. I would be open to working on that as well
cc: @Nutomic @dessalines @dullbananas
I think there are a bunch of "nibble around the edges" changes that could be made, especially around WHERE EXISTS but there's only so much you can do up against limit/offset and the Triumvirate of the ranking algorithm, federation, and community subscriptions.
Since the term (Time + 2)^Gravity is always going to dominate as Time -> infinity, I believe we can get a rough estimate of Rank from the time alone. Therefore, I think the way to get to speed is to denormalize every post in the database into hourly "tranches" based on original post time, then when paging only evaluate a tranche at a time until the page request is satisfied. Not only should this massively cut down on the number of Post records the planner has to consider at once, I also think this brings us to the brink of "keyset pagination"
Clearly there will be difficulties around the boundaries of the tranches - maybe daily tranches are better? Any time restriction that excludes most of the Post records in a fast way is the key idea
EDIT:
I now see and understand now that the rank scores are already being stored on the Post and that those columns are indexed. This should already address to a large extent some of the things I was saying in the previous paragraphs about querying by Rank.
The things I was saying about pagination and joins are now I think more relevant and I'm going to look there first now.
Now is definitely the perfect time to make breaking changes. And database optimization is something we can use a lot of help with, so thanks for looking into this!
Would there be openness to schema changes? What if the migrations have to be written in raw SQL? Are there ideological objections to triggers / cache tables / materialized views or denormalization in general?
schema changes are fine. Migrations are already written in raw sql (see the migrations folder). There are also a bunch of triggers. We used materialized views in the past but got rid of them (https://github.com/LemmyNet/lemmy/pull/908). Tables are already denormalized in some cases to improve performance so thats okay.
Would there be openness to breaking this query up? Right now it seems to handle every possible permutation of inputs and it's difficult to follow and optimize. (I'm thinking the admin/moderator view would be the first subset of inputs to break off into fully or partially separate code path)
Makes sense to me, post_view is very complex so it would be easier to maintain as several smaller files.
If we had to change FE<=> BE API around pagination for posts to get rid of LIMIT...OFFSET pagination can we do that breaking change as part of the push to 1.0?
@dessalines is already working on removing all the leftover limit/offset pagination in favor of cursors (https://github.com/LemmyNet/lemmy/pull/5424, https://github.com/LemmyNet/lemmy/pull/5429
I didn't even mention the question of testing, I think ultimately the project should have a way to bulk seed a local postgres with semi-realistic fake data. I would be open to working on that as well
This definitely sounds useful. @dullbananas already created a small performance test: https://github.com/LemmyNet/lemmy/tree/main/crates/db_perf
Lemmy uses keyset pagination already (except for legacy support). The ranking functions are all indexed where possible (and denormalized). We already replaced most of the joins with subqueries as far as I remember.
Afaik it is not possible to optimize the post query much further, so I wouldn't get my hopes up too much.
The problem is not solveable: A user has subscribed to any random subset of all communities. Depending on the communities subscribed to, the front page may contain posts from very varying time spans. If a user subscribes to many popular communities, the time frame will only be a few hours. A user that subscribes to few communities may see a time frame of multiple weeks. The time frame is different for every community.
The time frame to query differs for every community based on its respective popularity.
Postgresql sometimes chooses to filter communities first, ranking second. Sometimes it chooses to filter ranking first, then communities second. You can see these in the query plans. The first option is to basically loop through every community the user subscribes to individually, then get the corresponding page (based on ranking), then join. The other is to get posts from all communities, then filter out the ones not subscribed to. PG does all this internally, based on global statistics.
This decision should be made on a per-user basis though, which the PG planner does not really do - this is something we could do manually in theory or maybe by tweaking the PG statistics. It would require multivariate statistics - the PG planner would need to at least know for every user ID the number of communities to expect, and to be good it needs to know the size of each of those communities as well (to estimate number of posts to expect per ranking). I doubt it's possible honestly.
Making the decision manually could work, but it would make the code a lot more complicated, potentially incomprehensible, and I don't think the benefit would be large.
The only "real" improvement afaik would be to cache the front page based on the user asking for it. That's what reddit does (did), cache the front page view of each user for an hour or so, including the next 10-20 pages.
For joins, the important part (afair) is to reduce the number of joins below the geqo limit (default 12 joins). Afaik this is the case currently. Below that limit, i don't think there's a performance difference between a JOIN and a subquery? Since the planner will treat it the same (as long as the statistics don't tell it to treat the join as an early filter)
I really like the ranking algorithm but so far I’m finding it difficult to optimize from a query perspective
The ranking-type scores are updated periodically and cached on the posttable directly. So we can (and do) use indexes directly for those. But there are a LOT of filters and sorting fields, so it's really difficult to know how well our indexes are working.
I think the best place to start, would be to expand upon @dullbananas 's pg_perf tests. Predictably measuring performance of all our queries is the best first step, then making improvements / tweaks, and comparing the before / after.
Or if there are some postgres analyzer visualizers that can help us identify the join / query bottlenecks.
I see the most potential gains coming from tweaking indices, and possibly moving some things to WHERE EXISTS like you mentioned.
post_view could be better broken up, but I don't know that it'd help us that much. Having comments detailing index ordering (after we figure out what's best from testing) might be more helpful.
The only "real" improvement afaik would be to cache the front page based on the user asking for it. That's what reddit does (did), cache the front page view of each user for an hour or so, including the next 10-20 pages.
If this ends up being our only choice (for the subscribed query at least), then I wouldn't be opposed to adding a postgres table for this (similar to our combined tables). IE a local_user_subscribed_post {local_user_id, post_id, ...maybe sorting columns too?}. Built by either triggers or scheduled jobs.
Should be a last resort though.
@Nutomic @phiresky @dessalines
Thank you for your responses. I still have a trick up my sleeve that I think will work but I will have to prove it out in code.
I agree in principle that having the performance tests as the top priority is the way to go, but my brain just can't let go of this query optimization idea.
My goal is to have a (semi-?)functional draft by Monday, could be wildly optimistic.
For joins, the important part (afair) is to reduce the number of joins below the geqo limit (default 12 joins). Afaik this is the case currently. Below that limit, i don't think there's a performance difference between a JOIN and a subquery? Since the planner will treat it the same (as long as the statistics don't tell it to treat the join as an early filter)
There's also from_collapse_limit and join_collapse_limit which are 8 by default.
https://www.postgresql.org/docs/17/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER
#5556 adds a few more joins to post_view and comment_view to detect bans from different instances. This brings up the number of joins from 9 to 11 for each of them. These are critical for performance, so we will have to optimize that, eg adding fields post.creator_banned and comment.creator_banned which are populated by triggers, and can reduce the number of joins to 7 each.
Just a random example - this is what youtube shows when you have the home page open for a while and then try to load more elements (scroll down)
indicates they also generate the feed, then cache it for a while
Two Questions:
- If we had a
local_user_subscribed_post {local_user_id, post_id}table, it would need to contain every post in those subscribed communities, no? Otherwise we wouldn't be able to sort by say,Old.- Either that, or we'd have to restrict the sorts for subscribed, and possibly move that to its own endpoint to enforce its own sorts.
- We'd also need to include all the sorting columns on the above table, correct? Otherwise we wouldn't be able to use @dullbananas key pagination library.
Fixing this must include index changes to match the changes made to post_view in #5429
Let me post this here as well since I think it's no longer in the minds of the people currently thinking about posts performance. This was my previous PR that implemented cursor pagination and optimized the subscribed post view using a two-bounded prefetch:
PR: https://github.com/LemmyNet/lemmy/pull/3872
Important Comment 1: https://github.com/LemmyNet/lemmy/issues/2877#issuecomment-1673597190 Important Comment 2: https://github.com/LemmyNet/lemmy/issues/2877#issuecomment-1674869898
Comment 2 contains a standalone SQL example that shows how PostgreSQL can not possibly optimally get a "subscribed" frontpage due to limitations of the query planner - even without a single join. The workaround is implemented as described in the PR.
Here is a diagram showing how the ALL fetches should work:
Here's a diagram showing how the Subscribed post fetch works now and why it previously was slow:
I'm of course not against re-adding this for pagination reasons (#5621), but I think having a front page cache , with a local_user_subscribed_post table, and possibly a limited set of subscribed sorts, would be a better long-term solution.
I still have some questions above too.
I think a front-page cache is a reasonable solution (though kind of unrelated to the prefetch question - even with caching, we want the query to be optimal)
But it needs some data I think in order to be useful? Statistics of what we actually need and how many queries it would save.
For example, here's what I think might be reasonable:
- We create a unique identifier for most post queries a user can make (e.g.: Subscribed vs Local vs All, Hot sort vs Top Time frame sort, ...). basically a hash of the query
- We have a table like
post_views_cache (query_hash, cache_timestamp, post_rank, post_id) - When a user requests a post view
- we look into the cache, and if there is values less old than an hour, we return those values
- if not, the real query is run for the first 10 pages, and the result is stored in the cache table
- in a regular refresh task, all rows older than an hour are deleted from the cache table.
Notes:
- I don't think it makes sense to split this cache primarily by user ids. Because we'd want to have a common cache for at least the "local" page and the "all" page .
- This could be reduced by only storing in cache for certain queries (e.g. only the "hot" sort), but I think it makes sense to leave the structure generic
- It likely makes sense to cache more than the first page, and rather fetch 10 pages or so at once, simply due to the overhead of one query
- Pagination key would likely be different then
Now the question is: What is actually the cache hit rate for this kind of cache for a normal instance? Because if the cache hit rate is < 50%, it's kinda worthless. So it's a question of how people actually interact with an instance.
I don't know that we need a cache for most of the local / all / specific-community post queries, because they can use indexes and filters to get down < 50ms . Everything necessary is already on the post table, so we can just use proper indexes there for everything we need. In those cases a cache would just overcomplicate things.
But with the subscribed query, as you discovered when doing paging, it needs to join to the community_follower table (which is unique for every user), so postgres has to do everything in memory, and it can't take advantage of simple table indexes (the necessary info is split out into two tables after all).
A lot of the local / all caching for un-logged in users can be handled by nginx, and it doesn't really need more DB caching.
In general I hate caching because of the complication it adds, and we should always prefer source data when possible. But I think subscribed merits its own cached table because of these issues.
EDIT:
The prefetch also has to do with paging, but a cached table for local_user_subscribed could just be periodically filled with the last X rows (maybe 1000) even, so it wouldn't need any pre-fetches.
Sure, the subscribed query is the one where it's most clear that a single index scan can't really reduce the row count down to ±2x the resulting row count. There's many variants there though, depending on the sorting flag, the time frame filter, the "saved posts only" thing, etc.
could just be periodically filled
Do you mean to actually pre-fill this table for all users regardless of whether the user is currently active? That sounds like a really bad idea, just imagine how that would scale. 100k user accounts times 1000 posts times number of queries you want to cache, and most of those user accounts are likely dead.
Agree, thinking about it more, it'd be really intensive of both disk space and possibly filling the join.
Even if the table were only (post_id, local_user_id, time), that' be a minimum of 4+4+8 = 16 bytes per row.
Do you have timings or even better query plans (auto analyze/pg_stat_statements) for slow post view queries? because when I added the two-sided boundary in artificial tests query times were <100ms, and I thought results were similar in production environment as well. Was there some other regressions maybe?
I haven't outside of one speed test in post_view I added, but part of this issue should be adding more.
If its <100ms that should be fine, but it'd be nice to be able to have this all in one table, or be done in some way where postgres could do a better job of order by / limit.
I'll start to re-add the prefetch in #5621
Just random info i found: https://www.reddit.com/r/TheoryOfReddit/comments/8yg740/does_reddit_still_select_at_random_only_50_of/
To date, we haven't been able to make more than 250 run fast enough at our scale. Each time the feed is refreshed for someone we load up to 1K posts per subscription or 250K total posts.
So basically, they have (had) a cache duration of 30 minutes, and the fetch used to take 50 random subscribed communities, fetch 1k posts for each of them, mergesort, and save that in cache.
I think we could definitely benefit from a local_user_subscribed_post_cache table, that fills periodically, for active users, and has some given time limits. I don't see any other way around the problem.