datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

Extremely Slow Adding New Products to Data Cube

Open omad opened this issue 7 years ago • 7 comments

Expected behaviour

I should be able to add a new product to a Data Cube index and not have to wait.

Actual behaviour

It's very slow (tens of minutes) to add each new product to the production DEA Database.

Steps to reproduce

As an admin on a large Data Cube Index, run:

datacube product add <product-defn.yaml>

Environment information

  • Which datacube --version are you using? Open Data Cube core, version 1.6rc2+108.g096a26d5

  • What datacube deployment/enviornment are you running against? Database Server: agdc-db.nci.org.au NCI Module: dea/20180815

omad avatar Sep 13 '18 05:09 omad

Version:       1.7+0.g98cf9ba3.dirty
Config files:  <ODC config>
Host:          <OWS-DEV-RDS>:5432
Database:     ows
User:          ows
Environment:   None
Index Driver:  default

Valid connection:       YES

Still present in DEA-dev environment. Adding new products is slow. Killing the datacube add process and retrying makes the operation instant.

whatnick avatar Dec 23 '19 01:12 whatnick

@whatnick what you describe simply skips index generation step making future queries really slow, particularly on larger databases.

Kirill888 avatar Dec 23 '19 01:12 Kirill888

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 08 '20 06:08 stale[bot]

Bump!!

whatnick avatar Sep 03 '20 04:09 whatnick

Here is a benchmark

time datacube product add https://raw.githubusercontent.com/GeoscienceAustralia/digitalearthau/develop/digitalearthau/config/products/cemp_insar_radarsat2_displacement.yaml
Adding "cemp_insar_radarsat2_displacement" (this might take a while) DONE

real    13m50.068s
user    0m1.393s
sys     0m0.254s

whatnick avatar Sep 03 '20 08:09 whatnick

In my case, adding a product to an already populated with $200\times10^6$ Datasets takes well over 20-30 minutes to complete. Exploring the running SQL statement in pgAdmin shows that Postgres is creating an INDEX for that product. The effects of this INDEX are impressive. If querying by dataset label takes about 1 min to return the unique result, querying by product and label takes only 2-3 sec!

sotosoul avatar May 12 '23 10:05 sotosoul

This is due to design decisions made with the current PostgreSQL database driver, where it creates multiple partial indexes when adding a new product, which scales in line with the number of indexed datasets.

This will be greatly improved, hopefully to the point of being a non issue with the new PostGIS driver coming in v1.9 or v2.

omad avatar May 18 '23 03:05 omad