quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Long transaction in PostgreSQL when marking large number of splits for deletion

Open earlbread opened this issue 3 months ago • 5 comments

When marking a large number of splits for deletion, MetaStore(PostgreSQL) experiences long-running transactions that cause database lock contention and performance degradation.

The mark_splits_for_deletion operation processes all splits in a single transaction without batching, which can lock thousands of rows simultaneously.

    // retention_policy_execution.rs
    let mark_splits_for_deletion_request =
        MarkSplitsForDeletionRequest::new(index_uid, expired_split_ids);
    ctx.protect_future(metastore.mark_splits_for_deletion(mark_splits_for_deletion_request))
        .await?;
2025-09-30 09:00:34.807 | 2025-09-30T00:00:34.807Z ERROR quickwit_janitor::actors::retention_policy_executor: Failed to execute the retention policy on the index. index_id=log.common.application_log_v1_quickwit error=request timed out: client
2025-09-30 09:00:02.393 | 2025-09-30T00:00:02.393Z  INFO quickwit_janitor::retention_policy_execution: Marking 245742 splits for deletion based on retention policy. index_id=log.common.application_log_v1_quickwit split_ids=["01K3W82WGQVVPQBX67JPA51715", "01K3YMY9VYP55QZPX01VHNXJBE", "01K3W6G2FZCEPD6QBM6KHEHE2X", "01K3WDBKTNC3K09KPNV9SVCM4M", "01K3W82VYEN7Q96WFEK6RE5P3Z", and 245737 more]

This problem occurs periodically in the morning hours when retention policies are evaluated and large numbers of splits need to be marked for deletion.

To avoid long-running transactions and database lock contention, these operations should be processed in smaller batches.

earlbread avatar Sep 30 '25 05:09 earlbread

As a short-term fix, you could change your retention policy evaluation schedule to hourly: docs.

guilload avatar Sep 30 '25 08:09 guilload

That would definitely help reduce the load! Thank you!

earlbread avatar Sep 30 '25 09:09 earlbread

There already is an index on (index_uid, split_id) on the splits table:

Indexes:
    "splits_pkey" PRIMARY KEY, btree (index_uid, split_id)
    [...]

but I can attempt to optimize this query. If I give you access to a custom image, would you be able to test it for me? I currently don't have access to an environment with that many splits.

Also, can you paste the output of EXPLAIN... for that query.

guilload avatar Sep 30 '25 09:09 guilload

@guilload Since this is a production environment, it is difficult to change the image and test it right away.

Below is the result of the EXPLAIN query, but since the number of split_ids is small, the results might be different.

Aggregate  (cost=52.14..52.15 rows=1 width=48)
  CTE input_splits
    ->  Nested Loop Left Join  (cost=0.55..8.37 rows=5 width=42)
          Join Filter: (input_splits_1.split_id = (splits.split_id)::text)
          ->  Function Scan on unnest input_splits_1  (cost=0.00..0.05 rows=5 width=32)
          ->  Materialize  (cost=0.55..8.24 rows=1 width=37)
                ->  Subquery Scan on splits  (cost=0.55..8.24 rows=1 width=37)
                      ->  LockRows  (cost=0.55..8.23 rows=1 width=43)
                            ->  Index Scan using ix_splits_indexuid_deleteopstamp on splits splits_1  (cost=0.55..8.22 rows=1 width=43)
                                  Index Cond: ((index_uid)::text = 'log.common.application_log_v1_quickwit:01K180MYSKWTD0ZEANKNFRBMVD	'::text)
                                  Filter: ((split_id)::text = ANY ('{01K3W82WGQVVPQBX67JPA51715,01K3YMY9VYP55QZPX01VHNXJBE,01K3W6G2FZCEPD6QBM6KHEHE2X,01K3WDBKTNC3K09KPNV9SVCM4M,01K3W82VYEN7Q96WFEK6RE5P3Z}'::text[]))
  CTE marked_splits
    ->  Update on splits splits_2  (cost=0.68..43.62 rows=3 width=1112)
          ->  Nested Loop  (cost=0.68..43.62 rows=3 width=1112)
                ->  CTE Scan on input_splits input_splits_2  (cost=0.00..0.10 rows=5 width=88)
                ->  Index Scan using splits_pkey on splits splits_2  (cost=0.68..8.70 rows=1 width=970)
                      Index Cond: (((index_uid)::text = 'log.common.application_log_v1_quickwit:01K180MYSKWTD0ZEANKNFRBMVD'::text) AND ((split_id)::text = input_splits_2.split_id))
                      Filter: ((split_state)::text = ANY ('{Staged,Published}'::text[]))
  ->  CTE Scan on input_splits  (cost=0.00..0.10 rows=5 width=110)

earlbread avatar Sep 30 '25 13:09 earlbread

Super useful, thank you. The first part of the query does not use the index, so... that's no bueno :)

guilload avatar Sep 30 '25 14:09 guilload