Make vote timestamps optional (ref #5016)
Turns out timestamps and comment_like.post_id arent used at all in the backend. So this PR adds a setting database.store_vote_timestamps: bool to avoid storing them (default false). Also comment_like.post_id is removed. If timestamps are enabled, they are now returned by the api in like list endpoints. With the setting disabled, storage usage changes like this:
Previous size of post_like column: 4 + 4 + 2 + 8 = 18 bytes Previous size of comment_like: 4 + 4 + 2 + 8 + 4 = 22 bytes
After this change both are 10 bytes so the table size should also be reduzed by half roughly, so in case of lemmy.ml from 15 gb to ~7.5 gb. With 40gb total db size this is a reduction by 19%.
If this isn't used, how does Lemmy deal with receiving vote activities out of order? For example, someone upvotes a post, then downvotes a post, but due to parallel sending and a random connection failure the downvote gets sent first, then the upvote gets sent. The result should still show the user having downvoted the post on remote instances.
I mentioned this in the other issue already, would it save a meaningful amount of space if the column was nullable and a scheduled task sets e.g. all votes older than a week to published at NULL?
The timestamp for outgoing activities is generated in a different place, and doesnt use these columns at all.
I'm talking about the receiving instance. For example, lemmy.ml sends these activities out of order to lemm.ee, how would lemm.ee know to drop the older but later received vote? Given that there could be some safeguards within Lemmy on the sending side, the same issue would exist with other apub software that has asynchronous parallel sending of activities.
At the moment there is no logic on the receiving side to reorder incoming activities. So we trust that sending happens in the correct order (which is true for Lemmy).
I attempted to implement this in https://github.com/LemmyNet/lemmy/pull/4559 but it was too complicated.
Its used in vote abuse tracking
just to mention, afaik deleting a column doesn't actually free any space for existing databases, and the methods to do so require full exclusive lock (vacuum full) https://dba.stackexchange.com/questions/117510/reclaim-disk-space-from-dropped-column-without-downtime . doesn't mean this is a bad idea though and I'm not sure about new rows
We can mention in the release notes about running vacuum to reclaim disk space. Or how about changing the columns to nullable and setting all values to null before dropping? That should theoretically reclaim most of the space.
Edit: Added this now.
Space is reclaimed by postgres, but without a full vacuum the OS will not be notified about it. In general - that's not an issue as postgres will reuse that space if and when necessary for other things.
Its used in vote abuse tracking
Can concur. There was a vote manipulation campaign recently, and having the timestamps available for analysis proved very helpful in identifying participating accounts.
I was actually hoping for the opposite of this PR: the vote timestamps being added to the API response so I didn't have to always run analysis against the database.
Updated, see op for details.
In that case then maybe lets just keep the timestamp columns and only remove the post_id one. We'll have to think of other ways to save space, like making these boolean instead of smallint (2B -> 1B). But that could be done later.
Turns out timestamps and comment_like.post_id arent used at all in the backend. So this PR adds a setting database.store_vote_timestamps: bool
I'm not a big fan of that solution, because timestamps affect every table equally, and I think we should just accept that its a space hit we can't avoid. Could get increasingly complicated if we start adding special rules for specific tables. There's gotta be other ways to deal with the space issue.
There is nothing complicated, it just checks the setting before writing a vote, and either sets it to the current time or to null. And the column isnt used anywhere in the backend so the actual value doesnt matter for us. Most instances wont enable the setting, and will benefit from 20% lower db size which is very significant.
I won't be a blocker for adding it then, but I'd like @phiresky @dullbananas and @SleeplessOne1917 to chime in too.
Timestamp shouldn't be optional.
There's some possible ways to reduce the storage space of the timestamp, all of which will be more effective if it's replaced with the duration between the post or comment timestamp and the vote:
- Split into two 32 bit integers, and make one of them optional
- Use an extension that provides column compression (not timescaledb because that's pooprietary)
- Compression in the file system
Whats wrong with an optional timestamp if its not required anywhere in the code? We can rename it to published_opt to make it completely clear that its not a normal timestamp. Its really a minor problem considering that it reduces db size by 1/5th.
Using some kind of manual compression would be too complicated, and would not allow sql queries to find abusive votes. From what I can find postgres doesnt have any compression which would help in this case.
There is actually a place where vote timestamps are used in Lemmy: community and site aggregates for active user counting
https://github.com/LemmyNet/lemmy/blob/859dfb3f81235de229945d1d4f41717dc320442b/migrations/2023-12-06-180359_edit_active_users/up.sql#L35-L55
https://github.com/LemmyNet/lemmy/blob/859dfb3f81235de229945d1d4f41717dc320442b/migrations/2023-12-06-180359_edit_active_users/up.sql#L92-L110
@Nothing4You Youre right, then this is not really an option. Unless we only delete timestamps for votes which are older than six months, but thats probably not worth the effort.
Can you re-open this one and just do the post_id column removal? The other PR got closed, but its migration was also incorrect.
I can do this if you like.
https://github.com/LemmyNet/lemmy/pull/5134