pgstac
pgstac copied to clipboard
Add partition_update_enabled option
We ingest images on a daily basis into the catalog and notice that using the check_partition
function here significantly slows down the ingestion process. Since we have prior knowledge of the temporal data distribution, we can pre-create the necessary partitions with
SELECT check_partition(
'collectionxxx',
tstzrange('2023-10-01', '2023-11-01', '[)'),
tstzrange('2023-10-01', '2023-11-01', '[)')
);
...
instead of checking them during every ingestion. This pull request offers an option to disable partition checking, enhancing the ingestion performance. If a required partition isn't created, the loader will raise an exception with an appropriate message.
hey @drnextgis, just want to give a heads up that I have seen this. I was out-of-the-office for a while. I'll be looking at this and a few other things in the check_partitions script this week.
@drnextgis I've added an option to the pypgstac loader and a new command in #226 that I think should ameliorate the issue that you are having, while still making sure that you can go back and update all your constraints, indexes, and the statistics that are kept up to date by the check_partitions and update_partition_stats calls.
I've added an option "--usequeue" and a command pypgstac runqueue
My thinking is that when you are doing a large data loading session, you can use the following workflow regardless of if you are using the query queue along with a cron.
pypgstac load items src/pgstac/tests/testdata/items1.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items2.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items3.ndjson --debug --usequeue
pypgstac runqueue --debug
This PR should be my last round of work before kicking off a 0.8.2 release, so would be curious for your review.
@bitner finally got around to testing your changes. The ability to enforce the query queue for loading is quite useful! However, it only partially resolves my initial concern, queuing up only the SELECT update_partition_stats
query. The heavy SELECT check_partition
still runs. I believe the original PR still holds value.