Split activity table into sent and received parts (fixes #3103)
The received activities are only stored in order to avoid processing the same incoming activity multiple times. For this purpose it is completely unnecessary to store the data. So we can split the table into sent_activity and received_activity parts, where only sent_activity table needs to store activity data. This should reduce storage use significantly.
Also reduces activity storage duration to three months, we can reduce this further if necessary.
Additionally the id columns of activity tables are removed because they are completely unused and risk overflowing (fixes #3560).
I was actually using the id column on sent activities on my (unfinished) experiment for a persistent and robust outgoing activity queue. So it would be helpful to keep for that purpose. You can easily replace it with a bigserial though for the wrapping problem.
Since the published() column is always initialized with now() that would be a kind of replacement - but system time isn't really guaranteed to be monotonic (e.g. summer time due to naivedatetime use, NTP synchronization) and the sequential ID makes sharding of fetches from the database easy.
If this is only because of disk space concerns, then this is probably an overly-complicated fix. Pictures are the disk-space killer, not the activity table.
This could be simplified by doing one of the fixes here: deleting every activity older than 3 months, or even 1 month, in scheduled jobs, as they mostly function as logging tables anyway.
The activity table is absolutely a disk space concern, on lemmy.ml it currently uses 24 gb while the next biggest table (comment_like) has only 470 mb. And the vast majority of stored activities are remote, meaning we dont have any reason to store the content. On lemmy.ml for the last month, there are 20 million remote and 4.5 million local activities stored. So the table split alone will reduce lemmy db usage by about 1/5.
Updated:
- added bigserial id columns
- use json instead of jsonb for smaller size
- limit migration to 100k rows for sent and 1m for received (not sure if those numbers are good)
- use exists and transaction to check that sent activity wasnt inserted before.
on_conflict_do_nothing()doesnt work because it doesnt actually return Option, so there is no way if the value was inserted or if it was conflicting with an existing row - move
insert_received_activityfrom receive to verify handlers, so that an activity which fails verification doesnt get verified multiple times
Awesome!
use exists and transaction to check that sent activity wasnt inserted before. on_conflict_do_nothing()
I want to mention that wrapping it in a transaction does nothing here - Postgresql has no way of locking a non-existent row. So the exists() call does not actually prevent another task from inserting a row between the exists() and INSERT, regardless of there being a TX or not.
I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?
on_conflict_do_nothing() doesnt work
In pure SQL If you do ON CONFLICT DO NOTHING RETURNING ID it returns a row if insert succeed and no row if it already existed. That should translate to .on_conflict_do_nothing().returning(received_activity::id).get_result(conn).optional() though I haven't tried.
I suggested on_conflict_do_nothing because I thought the goal was to not log this as an error, but if you want it to actually log the error then why add the check, the insert will already reliably throw the same error? Maybe rather use .map_err()?
The error needs to be inserted in Rust so that it can stop processing this duplicate activity. However it should not go to the postgres logs, which are currently full of these lines:
postgres_1 | 2023-07-13 10:17:23.039 UTC [32101] STATEMENT: INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1 | 2023-07-13 10:17:30.145 UTC [32108] ERROR: duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1 | 2023-07-13 10:17:30.145 UTC [32108] DETAIL: Key (ap_id)=(https://lemmy.ml/activities/dislike/ff0f47ab-896b-4d52-bf49-60ad290bf5cb) already exists.
postgres_1 | 2023-07-13 10:17:30.145 UTC [32108] STATEMENT: INSERT INTO "activity" ("data", "local", "updated", "ap_id", "sensitive") VALUES ($1, $2, DEFAULT, $3, $4) RETURNING "activity"."id", "activity"."data", "activity"."local", "activity"."published", "activity"."updated", "activity"."ap_id", "activity"."sensitive"
postgres_1 | 2023-07-13 10:17:37.193 UTC [32069] ERROR: duplicate key value violates unique constraint "idx_activity_ap_id"
postgres_1 | 2023-07-13 10:17:37.193 UTC [32069] DETAIL: Key (ap_id)=(https://lemmy.ml/activities/dislike/a0077be3-5490-4764-b135-f95af9e57d9c) already exists.
Will try your suggestion.
The returning() line is also used when the insert was successful, not only in case of conflict. So it doesnt tell us if the activity is a duplicate.
The returning line is only used if the insert was successful. If it wasn't, then it won't return anything. Check this example:
You can maybe add it before the onconflictdonothing to make it less confusing.
Looks like returning works differently between postgres and diesel then. At least I cant find any way to get an option as result.
You tried with .optional() after .get_result()? after importing https://docs.diesel.rs/master/diesel/result/trait.OptionalExtension.html. it's an extension trait for some reason
Alright I got it working now and also added test cases. Ready to merge.