pgstac icon indicating copy to clipboard operation
pgstac copied to clipboard

Add partition_update_enabled option

Open drnextgis opened this issue 1 year ago • 3 comments

We ingest images on a daily basis into the catalog and notice that using the check_partition function here significantly slows down the ingestion process. Since we have prior knowledge of the temporal data distribution, we can pre-create the necessary partitions with

SELECT check_partition(
    'collectionxxx',
    tstzrange('2023-10-01', '2023-11-01', '[)'),
    tstzrange('2023-10-01', '2023-11-01', '[)')
);
...

instead of checking them during every ingestion. This pull request offers an option to disable partition checking, enhancing the ingestion performance. If a required partition isn't created, the loader will raise an exception with an appropriate message.

drnextgis avatar Oct 17 '23 20:10 drnextgis

hey @drnextgis, just want to give a heads up that I have seen this. I was out-of-the-office for a while. I'll be looking at this and a few other things in the check_partitions script this week.

bitner avatar Nov 06 '23 15:11 bitner

@drnextgis I've added an option to the pypgstac loader and a new command in #226 that I think should ameliorate the issue that you are having, while still making sure that you can go back and update all your constraints, indexes, and the statistics that are kept up to date by the check_partitions and update_partition_stats calls.

I've added an option "--usequeue" and a command pypgstac runqueue

My thinking is that when you are doing a large data loading session, you can use the following workflow regardless of if you are using the query queue along with a cron.

pypgstac load items src/pgstac/tests/testdata/items1.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items2.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items3.ndjson --debug --usequeue
pypgstac runqueue --debug

This PR should be my last round of work before kicking off a 0.8.2 release, so would be curious for your review.

bitner avatar Nov 07 '23 22:11 bitner

@bitner finally got around to testing your changes. The ability to enforce the query queue for loading is quite useful! However, it only partially resolves my initial concern, queuing up only the SELECT update_partition_stats query. The heavy SELECT check_partition still runs. I believe the original PR still holds value.

drnextgis avatar Feb 05 '24 07:02 drnextgis