datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

Quickly Determine Extents of a Product

Open omad opened this issue 6 years ago • 12 comments

It is extremely slow to query the full available extents for a large Product in a Data Cube Index.

This has been frustrating for several developers, and has led to multiple implementations of precomputing and caching this information. Details in the code of each of the following projects:

This slowness also affects the pre-flight job tasking which is done

A simple solution would be an extra database table containing extents and stats about each product, with an offline job which is run regularly to keep it up to date.

The easiest way to implement this would be to take the code from one of the above projects, and include it in datacube-core, along with a command to update the table.

Outstanding Questions

  • Can the code from one of these projects be used almost directly, or will new code be required?
  • Should the new table store a simple bounding box, or use PostGIS powers to calculate a more accurate vector extent?
  • We don't have a robust migrations system for upgrade the Database Schema, it's handled by some bespoke code. Is it worth improving this at the same time, or hacking in something which runs automatically, or having a manually run command to enable this Extents extension.

omad avatar Feb 26 '18 00:02 omad

Great initiative. Thank you! Happy to help and/or test.

My perspective as answers to your questions:

  1. Hopefully the code can be borrowed. Each are (presumably) open-source / open-enough-licence ODC related projects so legalities are probably not an issue. Main question is fit-for-purpose, which will require an assessment of each implementation versus the current ODC DB schema and interface code. I suspect this is the point of raising these questions in this issue.

  2. PostGIS would be the generic solution, although it may be satisfactory to provide an approximate region boundary. PostGIS will at least allow a non-square boundary, which I think is an appropriate starting point.

  3. Effort vs return. I'd be happy with a "manually run command to enable this Extents extension" in the first instance (presuming this is easier than a full-blown migration or refactor effort).

mpaget avatar Feb 26 '18 01:02 mpaget

Thanks Matt for any future assistance. The solution I am currently looking at involves an extra table holding extent summaries though I did have a look for a solution entirely within the database. There are two aspects that needed performance improvement. The large size of the queries and the fat object issue. For the first one, we need to rely on concurrent execution. The type of parallelism present is essentially map-reduce, but I am looking for a light-weight solution involving SQLAlchemy connection pools and thread/multiprocess pools. If we are to utilise datacube higher-level interfaces such as lazy loads, we have to de-linearise generator objects which tend to linearize computation. To counter the large size of results, we will need to chop the SQL queries into chunks, along the time dimension. This leads to our second problem, we have to deal with the fat objects. Ideally, we need to shed extra data within the database itself, which require lazy load interface to be able to specify SQL functions. I have to verify to what extent this could be done with lazy loads. Are there any other issues or approaches that I need to know or investigate?

ashoka1234 avatar Mar 12 '18 03:03 ashoka1234

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 08 '20 06:08 stale[bot]

Is this functionality under development? I really miss a simple table with the spatial and time range the data covers. It should be an automatic process happening during the datacube dataset add steps, since given the dataset yaml the date is clear and the extent too.

clausmichele avatar Sep 10 '20 07:09 clausmichele

@clausmichele there is no plan to add this functionality within 1.8 series, it's a 2.0 thing. Reason is simple: as far as database layer is concerned it does not model time or space directly. There are just search fields within json blobs. It just so happens that there might be a key called time that one can extract from some json path, or there might be keys called lat and lon. It's all the same to db layer, supposedly, it's not quite true, since indexes for keys time,lon,lat are created in a special way. Not viable to implement efficiently with the current DB design.

Kirill888 avatar Sep 10 '20 08:09 Kirill888

Though if you run datacube-explorer you can get the extents from its summary tables.

woodcockr avatar Sep 14 '20 07:09 woodcockr

Thanks for the suggestion, however I'm currently unable to test it due to library incompatibility on my environment. As soon as I manage to run datacube-explorer I'll see if it helps with my use case.

clausmichele avatar Sep 14 '20 14:09 clausmichele

@woodcockr I managed to run datacube-explorer, where do I find the summary tables? How to update them when I add new datasets?

clausmichele avatar Sep 16 '20 09:09 clausmichele

I think @SpacemanPaul is working on this.

uchchwhash avatar Nov 07 '22 23:11 uchchwhash

@uchchwhash @SpacemanPaul Is this a thing now? If it's possible to obtain product extents directly from the datacube, I'd love to update dc.list_products() to include this info in its output table

robbibt avatar Nov 08 '22 00:11 robbibt

I'm not aware of anything particularly solving this yet, I'm going to reopen.

omad avatar Nov 09 '22 01:11 omad

The ODCv2 work and the postgis index driver will make this a lot easier, but yes, still a work in progress.

SpacemanPaul avatar Feb 10 '23 04:02 SpacemanPaul