Reduce DB / table sizes
Requirements
- [x] Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support or the matrix chat.
- [x] Did you check to see if this issue already exists?
- [x] Is this only a feature request? Do not put multiple feature requests in one issue.
- [x] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
- [x] Do you agree to follow the rules in our Code of Conduct?
Is your proposal related to a problem?
Our 1.0 migrations increase the size of our DB by a good amount. Lets use this issue to brainstorm and strategize on how to reduce it.
#5874 will help, but there are others that we could think of.
Here's the larger table+index sizes for a lemmy.ml prod DB after 1.0 migrations are run, and after #5873 gets merged:
| table_name | pg_size_pretty | pg_total_relation_size |
|---|---|---|
| post_actions | 25 GB | 26420830208 |
| comment_actions | 24 GB | 26144899072 |
| comment | 14 GB | 14712807424 |
| post | 5863 MB | 6148136960 |
| search_combined | 3643 MB | 3819806720 |
| person_liked_combined | 2077 MB | 2177777664 |
| person_content_combined | 1732 MB | 1816043520 |
| person | 1525 MB | 1598619648 |
| received_activity | 1294 MB | 1356996608 |
| person_actions | 836 MB | 876576768 |
| local_image | 727 MB | 762339328 |
| local_user_language | 505 MB | 529186816 |
| image_details | 379 MB | 397426688 |
| community_actions | 358 MB | 375439360 |
| sent_activity | 221 MB | 232005632 |
| remote_image | 177 MB | 185819136 |
| notification | 168 MB | 176668672 |
| login_token | 145 MB | 151666688 |
| community_language | 110 MB | 115015680 |
| community | 69 MB | 71958528 |
| person_saved_combined | 50 MB | 52240384 |
| modlog_combined | 42 MB | 43548672 |
| site_language | 22 MB | 22896640 |
| post_report | 19 MB | 19668992 |
| local_user | 15 MB | 15966208 |
cc @Nutomic @dullbananas
Describe the solution you'd like.
NA
Describe alternatives you've considered.
NA
Additional context
No response
Does VACUUM FULL reduce the sizes?
Yes, I ran it for 2 hours but didn't wait for it to finish. It decreased the total size from 130gb to 120gb
The current production db on lemmy.ml is 68.7 GB, so it looks like the migrations almost doubled its size. For comparison here are the current table sizes:
| table_name | pg_size_pretty | pg_total_relation_size |
|---|---|---|
| comment_like | 15 GB | 15926108160 |
| comment | 14 GB | 15187050496 |
| post_like | 9950 MB | 10433191936 |
| post_aggregates | 9902 MB | 10382999552 |
| comment_aggregates | 4816 MB | 5050376192 |
| post | 3992 MB | 4185677824 |
| received_activity | 2643 MB | 2771550208 |
| post_read | 2614 MB | 2741452800 |
| person_post_aggregates | 1208 MB | 1266163712 |
| person | 932 MB | 977215488 |
| local_image | 875 MB | 917651456 |
| local_user_language | 516 MB | 541343744 |
| image_details | 487 MB | 510148608 |
| sent_activity | 329 MB | 344563712 |
| remote_image | 239 MB | 250462208 |
| community_follower | 220 MB | 230285312 |
| comment_reply | 163 MB | 170565632 |
| login_token | 160 MB | 167682048 |
| community_language | 113 MB | 118497280 |
| person_aggregates | 95 MB | 99262464 |
| community | 55 MB | 58171392 |
| post_saved | 28 MB | 29442048 |
| site_language | 23 MB | 23658496 |
| federation_queue_state | 22 MB | 23494656 |
| post_report | 18 MB | 19177472 |
| local_user | 14 MB | 15007744 |
| registration_application | 14 MB | 14589952 |
| private_message | 10 MB | 10534912 |
So the biggest growth is in post_actions which is 25GB from previously 10GB + 2.6GB + 1GB (post_like, post_read, person_post_aggregates). comment_actions is similar, now 24GB from 15 GB (comment_like). Both of these look suspicious and can probably be optimized. comment is unchanged, and post decreased to 6GB from 4GB + 10Gb (aggregates). person and community are still very small. Another big increase is in the combined tables. These could be optimized by removing timestamps, and relying on timestamps from the respective main table. Other columns are unchanged or irrelevant.
Try this to see how exactly space is used in the largest tables: https://dba.stackexchange.com/a/23933
And space used for indexes: https://stackoverflow.com/a/71871616
Storage used for null values: https://stackoverflow.com/a/7383198
Edit: As @dullbananas mentioned in https://github.com/LemmyNet/lemmy/pull/5873#issuecomment-3101099879, keeping post_like and comment_like tables separate would most likely prevent the storage growth in post_actions and comment_actions, saving around 20GB. But it would also add more complexity and require more joins for read queries. So its better if we can optimize the existing schema using links above.
@dessalines Please check the db size again now that https://github.com/LemmyNet/lemmy/pull/5874 is merged.
Had a look at the size of a table row containing a single post upvote. Based on this the db size should be identical to 0.19.
For 0.19:
select pg_column_size(post_like.*) as rowsize_in_bytes from post_like where post_id = 3 and person_id = 4;
48 bytes
For 1.0 (dd034ba99):
select pg_column_size(post_actions.*) as rowsize_in_bytes from post_actions where person_id = 11 and post_id = 8;
50 bytes
One thing I noticed however is that we always store full datetime which requires 8 bytes to represent timestamps from 4713 BC to 294276 AD at a resolution of 1 microsecond. This is completely unnecessary for our use case as we only need to represent timestamps between 2020 (when Lemmy was created) to 2040 or so. A unix timestamp would only take 4 bytes to represent timestamps between 1970 to 2038. So we could save a lot of space by changing the storage format for action times.
Please check the db size again now that https://github.com/LemmyNet/lemmy/pull/5874 is merged.
This takes like a day to run with prod data, and is kind of a hassle, so I'd prefer to do this later, once we get closer to finishing a lot of the other things we're working on. We can leave this issue open, and run it when we get closer to the 1.0 release, and all our migrations are finished.
I think we should bite the bullet on postgres timestamps, and just use them. I don't want to y2k lemmy with a 2040 end date. Plus the work to try to move back to unix times, then transform them again to returning DateTime<Utc>>, is just not worth it. Lets use the native postgres / diesel timestamps.
It would be relatively easy to create a custom type ActionDateTime which manually implements ToSql and FromSql in order to handle conversions between unix timestamp and datetime string. To avoid any problems by 2040 we could even have the timestamps start at 2020 then the valid range would be until 2090. Anyway if the db size is currently similar to 0.19 this wont be necessary. Or it could be done sometime after 1.0 as it doesnt require a breaking change.
I agree with @Nutomic on not wanting to y2k lemmy, even with a 2090 end date. Also, the 2020 begin date can cause problems with federation, since there's things older than lemmy.
Instead of a fixed starting time for the timestamp, we could change post_actions and comment_actions to store time elapsed since the post/comment itself was created. Then we can store the time with 4 bytes, instead of currently 8 bytes for a full timestamp.
Another option I came across is compressed array storage for data like this, although it looks much more complicated to implement.
As long as your table has 8 columns or less, null storage is effectively absolutely free (as far as disk space is concerned).
From this answer. Im not sure if the mentioned 8 columns refers to the total, or only to nullable columns. post_actions currently has 10 columns of which 8 are nullable. We should ensure that it is really below the limit, and add an assertion to ensure that the limit is not exceeded by any future change.