go
go copied to clipboard
horizon: `history_claimable_balances` is not cleared out by the reaper.
We should:
- Make sure the reaper takes care of
history_claimable_balances
- Make sure all the other history tables are included.
- Add a test that ensure all the
history_
tables in the DB are cleared out (in order prevent this problem from happening to tables added in the future).
We have three tables which aren't cleaned up when deleting a range. Those are:
-
history_claimable_balances
-
history_liquidity_pools
-
history_accounts
-
history_assets
They all pair an entity (claimable balance, liquidity pool, account) with an internal id:
# \d history_accounts
id | bigint | | not null | nextval('history_accounts_id_seq'::regclass)
address | character varying(64) | | |
# \d history_liquidity_pools
id | bigint | | not null | nextval('history_liquidity_pools_id_seq'::regclass)
liquidity_pool_id | text | | not null |
# \d history_assets
id | integer | | not null | nextval('history_assets_id_seq'::regclass)
asset_type | character varying(64) | | not null |
asset_code | character varying(12) | | not null |
asset_issuer | character varying(56) | | not null |
# \d history_claimable_balances
id | bigint | | not null | nextval('history_claimable_balances_id_seq'::regclass)
claimable_balance_id | text | | not null |
In theory, every time we clear out a range of data (either for reingesting or for reaping) we should also remove any orphan entries in the tables above (an orphan entry is one whose internal id isn't used anymore, e.g. for history_claimable_balances
, that would be all entries which cannot be joined with history_operation_claimable_balances
or history_transaction_claimable_balances
).
However, the queries required to do so may be too expensive. As I see it we could either:
- clear out any orphan entries (by figuring out which ids are not matched in any other tables)
- before clearing out the other tables, figure out which ids are only used the in the ledger range to clear out (and would end up being orphans after the fact)
I think both (1) and (2) may be too expensive for large enough history_*
tables
maybe @sydneynotthecity may have a better querying suggestion?
Do you know how expensive of a query cost is too expensive?
Another option is that we could create a trigger that tracked any record deletions in the history_*
tables and logged them to an events logging table. We could set it up in a way where the deleted id was logged to the events table. Then, we could delete the orphaned entries based on a join to this events table, which should be significantly cheaper than joining on the original tables.
Once the records were cleared out, we could either update an indicator in the event tables for is_deleted = TRUE
or just wipe the records from the events table so that the table only retains orphaned records to be deleted.
@2opremio the other point is the history_transaction_claimable_balances
and history_operation_claimable_balances
were missing indices for claimable_balance_id
which is what was causing joins/searches on claimable_balance_id
to be super slow. @sreuland added the indices as a test in #4455 and it looks like it has greatly improved query performance.
It appears that the equivalent liquidity_pools
tables are also missing indices on liquidity_pool_id
only. My guess is that if we added those as well, we should see enough query performance improvement to be able to identify orphaned records by query, like you originally proposed.
let's wait for the indices to be added and retake this after that.
https://github.com/stellar/go/pull/4518 clears orphans from history_claimable_balances
and history_liquidity_pools
. The last two remaining tables (history_accounts
and history_assets
) require indexes in other tables like horizon_operation_participants
. @ire-and-curses agreed that new indexes should be add in one of the future releases after further tests. EDIT: I was able to actually use existing indexes to reap history_accounts
in https://github.com/stellar/go/pull/4525. The remaining one is history_assets
which require an index (even multicolumn index would work) in tables where it's used.