lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

Reduce DB / table sizes

Open dessalines opened this issue 5 months ago • 9 comments

Requirements

  • [x] Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support or the matrix chat.
  • [x] Did you check to see if this issue already exists?
  • [x] Is this only a feature request? Do not put multiple feature requests in one issue.
  • [x] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
  • [x] Do you agree to follow the rules in our Code of Conduct?

Is your proposal related to a problem?

Our 1.0 migrations increase the size of our DB by a good amount. Lets use this issue to brainstorm and strategize on how to reduce it.

#5874 will help, but there are others that we could think of.

Here's the larger table+index sizes for a lemmy.ml prod DB after 1.0 migrations are run, and after #5873 gets merged:

table_name pg_size_pretty pg_total_relation_size
post_actions 25 GB 26420830208
comment_actions 24 GB 26144899072
comment 14 GB 14712807424
post 5863 MB 6148136960
search_combined 3643 MB 3819806720
person_liked_combined 2077 MB 2177777664
person_content_combined 1732 MB 1816043520
person 1525 MB 1598619648
received_activity 1294 MB 1356996608
person_actions 836 MB 876576768
local_image 727 MB 762339328
local_user_language 505 MB 529186816
image_details 379 MB 397426688
community_actions 358 MB 375439360
sent_activity 221 MB 232005632
remote_image 177 MB 185819136
notification 168 MB 176668672
login_token 145 MB 151666688
community_language 110 MB 115015680
community 69 MB 71958528
person_saved_combined 50 MB 52240384
modlog_combined 42 MB 43548672
site_language 22 MB 22896640
post_report 19 MB 19668992
local_user 15 MB 15966208

cc @Nutomic @dullbananas

Describe the solution you'd like.

NA

Describe alternatives you've considered.

NA

Additional context

No response

dessalines avatar Jul 21 '25 17:07 dessalines

Does VACUUM FULL reduce the sizes?

dullbananas avatar Jul 21 '25 17:07 dullbananas

Yes, I ran it for 2 hours but didn't wait for it to finish. It decreased the total size from 130gb to 120gb

dessalines avatar Jul 21 '25 18:07 dessalines

The current production db on lemmy.ml is 68.7 GB, so it looks like the migrations almost doubled its size. For comparison here are the current table sizes:

table_name pg_size_pretty pg_total_relation_size
comment_like 15 GB 15926108160
comment 14 GB 15187050496
post_like 9950 MB 10433191936
post_aggregates 9902 MB 10382999552
comment_aggregates 4816 MB 5050376192
post 3992 MB 4185677824
received_activity 2643 MB 2771550208
post_read 2614 MB 2741452800
person_post_aggregates 1208 MB 1266163712
person 932 MB 977215488
local_image 875 MB 917651456
local_user_language 516 MB 541343744
image_details 487 MB 510148608
sent_activity 329 MB 344563712
remote_image 239 MB 250462208
community_follower 220 MB 230285312
comment_reply 163 MB 170565632
login_token 160 MB 167682048
community_language 113 MB 118497280
person_aggregates 95 MB 99262464
community 55 MB 58171392
post_saved 28 MB 29442048
site_language 23 MB 23658496
federation_queue_state 22 MB 23494656
post_report 18 MB 19177472
local_user 14 MB 15007744
registration_application 14 MB 14589952
private_message 10 MB 10534912

So the biggest growth is in post_actions which is 25GB from previously 10GB + 2.6GB + 1GB (post_like, post_read, person_post_aggregates). comment_actions is similar, now 24GB from 15 GB (comment_like). Both of these look suspicious and can probably be optimized. comment is unchanged, and post decreased to 6GB from 4GB + 10Gb (aggregates). person and community are still very small. Another big increase is in the combined tables. These could be optimized by removing timestamps, and relying on timestamps from the respective main table. Other columns are unchanged or irrelevant.

Try this to see how exactly space is used in the largest tables: https://dba.stackexchange.com/a/23933

And space used for indexes: https://stackoverflow.com/a/71871616

Storage used for null values: https://stackoverflow.com/a/7383198

Edit: As @dullbananas mentioned in https://github.com/LemmyNet/lemmy/pull/5873#issuecomment-3101099879, keeping post_like and comment_like tables separate would most likely prevent the storage growth in post_actions and comment_actions, saving around 20GB. But it would also add more complexity and require more joins for read queries. So its better if we can optimize the existing schema using links above.

Nutomic avatar Jul 22 '25 09:07 Nutomic

@dessalines Please check the db size again now that https://github.com/LemmyNet/lemmy/pull/5874 is merged.

Nutomic avatar Oct 17 '25 10:10 Nutomic

Had a look at the size of a table row containing a single post upvote. Based on this the db size should be identical to 0.19.

For 0.19:

select pg_column_size(post_like.*) as rowsize_in_bytes from post_like where post_id = 3 and person_id = 4;
48 bytes

For 1.0 (dd034ba99):

select pg_column_size(post_actions.*) as rowsize_in_bytes from post_actions where person_id = 11 and post_id = 8;
50 bytes

One thing I noticed however is that we always store full datetime which requires 8 bytes to represent timestamps from 4713 BC to 294276 AD at a resolution of 1 microsecond. This is completely unnecessary for our use case as we only need to represent timestamps between 2020 (when Lemmy was created) to 2040 or so. A unix timestamp would only take 4 bytes to represent timestamps between 1970 to 2038. So we could save a lot of space by changing the storage format for action times.

Nutomic avatar Oct 20 '25 11:10 Nutomic

Please check the db size again now that https://github.com/LemmyNet/lemmy/pull/5874 is merged.

This takes like a day to run with prod data, and is kind of a hassle, so I'd prefer to do this later, once we get closer to finishing a lot of the other things we're working on. We can leave this issue open, and run it when we get closer to the 1.0 release, and all our migrations are finished.

I think we should bite the bullet on postgres timestamps, and just use them. I don't want to y2k lemmy with a 2040 end date. Plus the work to try to move back to unix times, then transform them again to returning DateTime<Utc>>, is just not worth it. Lets use the native postgres / diesel timestamps.

dessalines avatar Oct 20 '25 17:10 dessalines

It would be relatively easy to create a custom type ActionDateTime which manually implements ToSql and FromSql in order to handle conversions between unix timestamp and datetime string. To avoid any problems by 2040 we could even have the timestamps start at 2020 then the valid range would be until 2090. Anyway if the db size is currently similar to 0.19 this wont be necessary. Or it could be done sometime after 1.0 as it doesnt require a breaking change.

Nutomic avatar Oct 21 '25 08:10 Nutomic

I agree with @Nutomic on not wanting to y2k lemmy, even with a 2090 end date. Also, the 2020 begin date can cause problems with federation, since there's things older than lemmy.

dullbananas avatar Oct 29 '25 02:10 dullbananas

Instead of a fixed starting time for the timestamp, we could change post_actions and comment_actions to store time elapsed since the post/comment itself was created. Then we can store the time with 4 bytes, instead of currently 8 bytes for a full timestamp.

Another option I came across is compressed array storage for data like this, although it looks much more complicated to implement.

As long as your table has 8 columns or less, null storage is effectively absolutely free (as far as disk space is concerned).

From this answer. Im not sure if the mentioned 8 columns refers to the total, or only to nullable columns. post_actions currently has 10 columns of which 8 are nullable. We should ensure that it is really below the limit, and add an assertion to ensure that the limit is not exceeded by any future change.

Nutomic avatar Dec 09 '25 14:12 Nutomic