Long transaction in PostgreSQL when marking large number of splits for deletion
When marking a large number of splits for deletion, MetaStore(PostgreSQL) experiences long-running transactions that cause database lock contention and performance degradation.
The mark_splits_for_deletion operation processes all splits in a single transaction without batching, which can lock thousands of rows simultaneously.
// retention_policy_execution.rs
let mark_splits_for_deletion_request =
MarkSplitsForDeletionRequest::new(index_uid, expired_split_ids);
ctx.protect_future(metastore.mark_splits_for_deletion(mark_splits_for_deletion_request))
.await?;
2025-09-30 09:00:34.807 | 2025-09-30T00:00:34.807Z ERROR quickwit_janitor::actors::retention_policy_executor: Failed to execute the retention policy on the index. index_id=log.common.application_log_v1_quickwit error=request timed out: client
2025-09-30 09:00:02.393 | 2025-09-30T00:00:02.393Z INFO quickwit_janitor::retention_policy_execution: Marking 245742 splits for deletion based on retention policy. index_id=log.common.application_log_v1_quickwit split_ids=["01K3W82WGQVVPQBX67JPA51715", "01K3YMY9VYP55QZPX01VHNXJBE", "01K3W6G2FZCEPD6QBM6KHEHE2X", "01K3WDBKTNC3K09KPNV9SVCM4M", "01K3W82VYEN7Q96WFEK6RE5P3Z", and 245737 more]
This problem occurs periodically in the morning hours when retention policies are evaluated and large numbers of splits need to be marked for deletion.
To avoid long-running transactions and database lock contention, these operations should be processed in smaller batches.
As a short-term fix, you could change your retention policy evaluation schedule to hourly: docs.
That would definitely help reduce the load! Thank you!
There already is an index on (index_uid, split_id) on the splits table:
Indexes:
"splits_pkey" PRIMARY KEY, btree (index_uid, split_id)
[...]
but I can attempt to optimize this query. If I give you access to a custom image, would you be able to test it for me? I currently don't have access to an environment with that many splits.
Also, can you paste the output of EXPLAIN... for that query.
@guilload Since this is a production environment, it is difficult to change the image and test it right away.
Below is the result of the EXPLAIN query, but since the number of split_ids is small, the results might be different.
Aggregate (cost=52.14..52.15 rows=1 width=48)
CTE input_splits
-> Nested Loop Left Join (cost=0.55..8.37 rows=5 width=42)
Join Filter: (input_splits_1.split_id = (splits.split_id)::text)
-> Function Scan on unnest input_splits_1 (cost=0.00..0.05 rows=5 width=32)
-> Materialize (cost=0.55..8.24 rows=1 width=37)
-> Subquery Scan on splits (cost=0.55..8.24 rows=1 width=37)
-> LockRows (cost=0.55..8.23 rows=1 width=43)
-> Index Scan using ix_splits_indexuid_deleteopstamp on splits splits_1 (cost=0.55..8.22 rows=1 width=43)
Index Cond: ((index_uid)::text = 'log.common.application_log_v1_quickwit:01K180MYSKWTD0ZEANKNFRBMVD '::text)
Filter: ((split_id)::text = ANY ('{01K3W82WGQVVPQBX67JPA51715,01K3YMY9VYP55QZPX01VHNXJBE,01K3W6G2FZCEPD6QBM6KHEHE2X,01K3WDBKTNC3K09KPNV9SVCM4M,01K3W82VYEN7Q96WFEK6RE5P3Z}'::text[]))
CTE marked_splits
-> Update on splits splits_2 (cost=0.68..43.62 rows=3 width=1112)
-> Nested Loop (cost=0.68..43.62 rows=3 width=1112)
-> CTE Scan on input_splits input_splits_2 (cost=0.00..0.10 rows=5 width=88)
-> Index Scan using splits_pkey on splits splits_2 (cost=0.68..8.70 rows=1 width=970)
Index Cond: (((index_uid)::text = 'log.common.application_log_v1_quickwit:01K180MYSKWTD0ZEANKNFRBMVD'::text) AND ((split_id)::text = input_splits_2.split_id))
Filter: ((split_state)::text = ANY ('{Staged,Published}'::text[]))
-> CTE Scan on input_splits (cost=0.00..0.10 rows=5 width=110)
Super useful, thank you. The first part of the query does not use the index, so... that's no bueno :)