digitalearthau icon indicating copy to clipboard operation
digitalearthau copied to clipboard

Indexing of ASTER LP DAAC Datasets to datacube

Open ashoka1234 opened this issue 5 years ago • 13 comments

Reason for this PR

ASTER data consists of visible and near infrared (VNIR) frequencies with three bands at 15-meter resolution, short-wave infrared (SWIR) frequencies with six bands at 30-meter resolution, and thermal infrared (TIR) wavelength with five bands at 90-meter resolution. ASTER SWIR data acquired since April 2008 are not usable, and show saturation of values and severe striping. Data CRS definitions are not available directly from the datasets but can be computed from the available information from the datasets. This PR attempt to compute CRS definitions and index ASTER datasets.

Proposed Solution

  • Three products are derived from the datasets corresponding to VNIR, SWIR, and TIR sensors so that each derived product has the same resolution simplifying Geo Transform computations. The CRS definition is computed per original dataset while GeoTransform is computed per sensor, i.e. datacube product.
  • Each indexed dataset points to a corresponding VRT file which defines the bands corresponding to the product referring to the original .hdf file.
  • The workflow is similar to modis indexing scripts with additional commands for VRT file generation. Thus there are separate commands for deriving and adding a product definition, generation VRT file, Generating and indexing a dataset all corresponding to a particular hdf file and for a particular product. A seperate bash script is provided to index multiple files.
  • The VRT references to hdf files uses absulute path names but indexed datasets uses relative path names to the VRT files. This is due to a GDAL/rasterio library restriction.
  • The definitions of which bands correspond to each product is defined by a global constant PRODUCTS.
  • The additional metadata to include or exclude are defined by means of common prefixes in a global constant EXTRA_METADATA_PREFIXES.

Tests

  • A datacube is created with the name aster_lpdaac (Use psql -h agdcdev-db.nci.org.au -d aster_lpdaac) for this work and number of datasets are indexed corresponding to three products, aster_l1t_vnir, aster_l1t_swir, and aster_l1t_tir.
  • Testing scripts are written (See scripts_tests module) and basic form of VRT schema was derived and tests provided for VRT file content validation (in memory).

ashoka1234 avatar Mar 22 '19 05:03 ashoka1234

Codecov Report

Merging #213 into develop will decrease coverage by 0.07%. The diff coverage is 33.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #213      +/-   ##
===========================================
- Coverage     66.8%   66.72%   -0.08%     
===========================================
  Files           42       42              
  Lines         3190     3189       -1     
===========================================
- Hits          2131     2128       -3     
- Misses        1059     1061       +2
Impacted Files Coverage Δ
digitalearthau/testing/plugin.py 80% <33.33%> (-7.1%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7425e06...f4a5dc7. Read the comment docs.

codecov[bot] avatar Mar 22 '19 05:03 codecov[bot]

@ashoka1234, @omad I have a few questions/comments:

  1. Do we really want 3 separate products?

    • What is the reason for grouping bands by resolution into separate products? I'm pretty sure that a single product with bands of different resolution would work just fine.
  2. Why custom indexing step instead of just writing yaml file next to vrt file and using datacube dataset add instead?

  3. Probably doesn't matter for a once off tool like this but still, I noticed you use file path as an API argument, as a result, same netcdf file is opened/closed many times. I think it would be more efficient but also clearer from the data flow perspective if those methods operated on some data loaded from file once. In this case tags and whatever other file metadata you require to generate VRT and YAML docs. This would also make writing tests easier as it is easier to mock a data structure than a whole netcdf file.

Kirill888 avatar Mar 24 '19 23:03 Kirill888

Thanks @Kirill888 for review comments. Answers to questions respectively:

  1. The decision to have three separate products was driven by the limitations that I saw with datacube and VRT files regarding the specification of Geo Transform. The reference given for computation of CRS adjust bounds used for geo transform values by half of pixel resolution and this resolution is dependent on sensor. VRTRasterBand don't seem to have geo transform elements so had to specify geo tranbsform at VRTDataset level and did not see a way to specify number of VRTDatasets per VRT file. I also didn't see datacube specify geo transforms per measurement. If the geo transform is approximate enough without resolution dependent adjustment or VRT/datacube could specify geo transform per band then we could ideally use a single product.
  2. Hm. for this one, I followed the style of modis indexing script but I agree with this style because it is simplistic but do you see any additional benefits of having yamls?
  3. Yes you are right on this one and I did have this concern as well and, I had extracted data passed to those functions earlier ensuring once read as much as possible but i felt interface was messy and resorted to passing file paths. But I mitigated the multiple file read of band names by means of functools.lru_cache. Still some multiple file opens remains (I think multiple opens for subdataset is unavoidable! There is additional file open to extract metadata, but since this is done only once I thought that was OK. This is probably my opinion but I felt passing internal GDAL representations to functions was low level so I tried to avoid it. But I will have another look and see we could simplify this issue.

ashoka1234 avatar Mar 26 '19 22:03 ashoka1234

@ashoka1234

  1. VRT generation is a separate problem from product definition, product can reference different VRT files for different bands. You are right about dataset.transform property being confusing and suggesting that all bands should cover the same area at the same resolution. The reality is that we already have products that have bands with slightly different footprints and different resolutions. Dataset footprint is supposed to be a union of individual band footprints, not "THE footprint for every band in the Dataset".

  2. Having files written out next to VRT would simplify indexing step. Users will want to index this product into their own custom databases. Having "the more standard" way of just calling datacube dataset add instead of learning how to run this custom tool would be preferable.

Kirill888 avatar Mar 26 '19 23:03 Kirill888

  1. I hadn't realised we had products with different resolution bands. Since we do, this is definitely what we should use for this ASTER products.

  2. We had originally specified to not write out YAMLS. However, for this case, since we have to write out and manage VRTs, we should also write out YAML files.

omad avatar Mar 27 '19 04:03 omad

Scanning about 100 ASTER dataset directories, the following are statistics: Total datasets: 2505 vnir bands present: 1434 swir bands present: 682 tir bands present: 2505

It seems 'tir is present in all datasets checked. If this behaviour generalises to whole collection, then tir could potentially provides minimal measurements to be included in product definition. But, I personally feel product definition should provide maximal measurement info.

ashoka1234 avatar Mar 27 '19 06:03 ashoka1234

@ashoka1234 that's another datacube limitation.

I think the most sane way is to have exactly the same set of bands across both product and dataset definitions.

  • Having extra bands in a dataset (those not defined in the product) is completely useless, as they won't be loadable by dc.load.

  • Listing all possible bands in a product definition and only including available bands in a given dataset, kinda makes sense, but is not supported by datacube. First of all indexing step will reject such dataset. Secondly there is probably a lot of code in datacube that assumes that any band name listed in a product definition will also appear in every dataset from this product.

I see 2 possible workaround here:

  1. (this is my preferred choice) We generate VRT that replaces missing bands with whatever available construct in the VRT toolkit, then Dataset yaml section as far as bands go is straightforward and there is a consistent mapping from band name to band index within the VRT.

  2. Yaml includes all bands but has a "broken path" for those bands that are missing. That way you can index and load data so long as you use ignore_missing option.

Advantage of option (1) is that it is simple to verify correctness of yaml as you don't need to keep track which netcdf variable ended up with which band index in the VRT.

Option (2) is basically a work around for datacube not having an option to indicate that a particular band is missing for this particular dataset.

Kirill888 avatar Mar 27 '19 22:03 Kirill888

@Kirill888 As per discussion we had, we will retain three products corresponding to three different sensors since these sensors can be on-off and depending on that, the corresponding bands are present in the dataset. I will also make sure band mappings are consistent.

ashoka1234 avatar Mar 27 '19 23:03 ashoka1234

As for consistency of band mappings the code currently rely on consistent ordering of subdatasets returned by the corresponding GDAL query. All the dataset files I checked seemed to have the same consistent order! But may be we should not rely on this GDAL feature.

ashoka1234 avatar Mar 28 '19 00:03 ashoka1234

Yes for consistency I sort the bands defined in the global constant PRODUCT and assign integer value as a string in that sorted order.

ashoka1234 avatar May 01 '19 07:05 ashoka1234

Looking pretty good. I don't think the tests are being run though. I can see that test_index_modis_lpdaac is run, but nothing for aster.

image

Try moving the modis test script into scripts_tests/ and add that path at the end of check-code.sh when pytest is run.

omad avatar May 02 '19 03:05 omad

OK sure will do! I thought it was unnecessary for scripts tests to run on Travis!

ashoka1234 avatar May 02 '19 04:05 ashoka1234

NP. I think that if we've gone to the effort of writing tests, we might as well do a bit more and have them run all the time in case anything breaks. :)

omad avatar May 02 '19 04:05 omad