pgstac icon indicating copy to clipboard operation
pgstac copied to clipboard

Optimizing query performance during ingestion process

Open drnextgis opened this issue 2 years ago • 3 comments

I noticed that this query takes a long time during the ingestion process:

SELECT
    tstzrange(min(datetime), max(datetime),'[]'),
    tstzrange(min(end_datetime), max(end_datetime), '[]')
FROM _items_1

If I'm not mistaken, it originates from here.

As we can see from the output of the EXPLAIN statement, it performs a sequential scan:

postgis=> EXPLAIN SELECT                                                             
    tstzrange(min(datetime), max(datetime),'[]'),
    tstzrange(min(end_datetime), max(end_datetime), '[]')
FROM _items_1;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1445345.11..1445345.12 rows=1 width=64)
   ->  Gather  (cost=1445344.88..1445345.09 rows=2 width=32)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=1444344.88..1444344.89 rows=1 width=32)
               ->  Parallel Seq Scan on _items_1  (cost=0.00..1399274.94 rows=4506994 width=16)
(5 rows)

My initial assumption was that the end_datetime property is not properly indexed:

postgis=> EXPLAIN SELECT tstzrange(min(datetime), max(datetime), '[]') FROM _items_1;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.25..1.26 rows=1 width=32)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.62 rows=1 width=8)
           ->  Index Only Scan Backward using _items_1_datetime_end_datetime_idx on _items_1  (cost=0.43..2036102.77 rows=10816785 width=8)
                 Index Cond: (datetime IS NOT NULL)
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.43..0.62 rows=1 width=8)
           ->  Index Only Scan using _items_1_datetime_end_datetime_idx on _items_1 _items_1_1  (cost=0.43..2036102.77 rows=10816785 width=8)
                 Index Cond: (datetime IS NOT NULL)
(9 rows)
postgis=> EXPLAIN SELECT tstzrange(min(end_datetime), max(end_datetime), '[]') FROM _items_1;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1422810.13..1422810.14 rows=1 width=32)
   ->  Gather  (cost=1422809.91..1422810.12 rows=2 width=16)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=1421809.91..1421809.92 rows=1 width=16)
               ->  Parallel Seq Scan on _items_1  (cost=0.00..1399274.94 rows=4506994 width=8)
(5 rows)

However, I later discovered that PgSTAC uses a multicolumn index on datetime and end_datetime here. I believe this was done intentionally. @bitner, do you think it would make sense to build an individual index for end_datetime?

Based on what I observe, having this index enables us to avoid a sequential scan:

CREATE INDEX "end_datetime_idx" ON items USING BTREE(end_datetime ASC);

drnextgis avatar Oct 14 '23 20:10 drnextgis

We definitely sacrifice some ingest performance for the sake of read performance.

In Postgres, adding additional indexes is not always the best thing -- it adds additional overhead that actually slows down inserts/updates as the additional index needs to get updated and it leads to contention for what indexes end up staying in memory rather than needing to be read from disk.

The multi-column index on datetime, end_datetime is optimal and better than two separate indexes for any search that uses the datetime= filter. It works well for anything that needs an index on either solely the datetime column or things that mix datetime/end_datetime, but it does not work well for things that are solely querying the end_datetime which really never happens in anything from the search endpoint.

As this is an index case that is really only used when updating partition statistics (which is generally done as part of an ingest process). The other option would be to use the ability to defer statistics and indexing by using the query queue. As I get things ready for the workshop I'm giving next week, I'm going to be beefing up the docs for the Query Queue, but basically it allows deferring certain calls (like updating statistics) to a job that gets run from a cron rather than with the transaction itself.

bitner avatar Oct 16 '23 14:10 bitner

Thank you for sharing this information! Based on what I've observed, it seems that the loader is already utilizing the update_partition_stats_q function here. To run it asynchronously, it appears that I need to install pg_cron and then execute something similar to the code below:

postgis=> SET pgstac.use_queue = 't';
SET

postgis=> SELECT * from query_queue;
 query | added 
-------+-------
(0 rows)

-- this is what happens in 'load_partition'
postgis=> SELECT update_partition_stats_q('_items_1');
 update_partition_stats_q 
--------------------------
 
(1 row)

-- checking that the query is in the queue
postgis=> SELECT * from query_queue;
                      query                      |             added             
-------------------------------------------------+-------------------------------
 SELECT update_partition_stats('_items_1', 'f'); | 2023-10-16 16:09:19.585958+00
(1 row)


postgis=> SELECT cron.schedule('0 0 * * *', 'CALL run_queued_queries();');

Is my understanding correct?

drnextgis avatar Oct 16 '23 16:10 drnextgis

Yep, that is correct.

I would note that as with all settings, if you set it using SET pgstac.use_queue='t'; Only queries that were "queued" during that session will be deferred. In order to always use the queue you would need to either set it at the role or database level ie ALTER DATABASE pgstac SET pgstac.use_queue = 't'; or by using the pgstac_settings table (settings will preferentially use the pgstac.* settings over anything in the pgstac_settings table, but not all DBaaS providers allow the use of arbitrary settings).

You can also run anything that is in the queue manually using either SELECT run_queud_queries_intransaction(); or CALL run_queued_queries();. The difference being that the "function" variant run_queued_queries_intransaction will run all the queued queries within a single transaction (and if there are any errors in any of them, all of them will be rolled back) whereas using the "procedure" variant will commit each query separately as it goes through them. Procedures are slightly different in Postgres in that they can have transactional control, but they must be called using CALL rather than SELECT.

If you didn't want to use the cron, another option if you were doing, say a large batch of ingests would be to set use_queue before doing the ingests. Ingest the data. Call run_queueued_queries(). and then reset use_queue to false.

bitner avatar Nov 07 '23 19:11 bitner