digitalearthau
digitalearthau copied to clipboard
Indexing of ASTER LP DAAC Datasets to datacube
Reason for this PR
ASTER data consists of visible and near infrared (VNIR) frequencies with three bands at 15-meter resolution, short-wave infrared (SWIR) frequencies with six bands at 30-meter resolution, and thermal infrared (TIR) wavelength with five bands at 90-meter resolution. ASTER SWIR data acquired since April 2008 are not usable, and show saturation of values and severe striping. Data CRS definitions are not available directly from the datasets but can be computed from the available information from the datasets. This PR attempt to compute CRS definitions and index ASTER datasets.
Proposed Solution
- Three products are derived from the datasets corresponding to VNIR, SWIR, and TIR sensors so that each derived product has the same resolution simplifying Geo Transform computations. The
CRS
definition is computed per original dataset whileGeoTransform
is computed per sensor, i.e. datacube product. - Each
indexed dataset
points to a correspondingVRT
file which defines the bands corresponding to the product referring to the original.hdf
file. - The workflow is similar to
modis
indexing scripts with additional commands for VRT file generation. Thus there are separate commands for deriving and adding a product definition, generation VRT file, Generating and indexing a dataset all corresponding to a particularhdf
file and for a particular product. A seperatebash script
is provided to index multiple files. - The VRT references to
hdf
files usesabsulute
path names but indexed datasets uses relative path names to the VRT files. This is due to aGDAL/rasterio
library restriction. - The definitions of which bands correspond to each product is defined by a global constant
PRODUCTS
. - The additional metadata to
include
orexclude
are defined by means of common prefixes in a global constant EXTRA_METADATA_PREFIXES.
Tests
- A datacube is created with the name
aster_lpdaac
(Use psql -h agdcdev-db.nci.org.au -d aster_lpdaac) for this work and number of datasets are indexed corresponding to three products,aster_l1t_vnir
,aster_l1t_swir
, andaster_l1t_tir
. - Testing scripts are written (See
scripts_tests
module) and basic form ofVRT schema
was derived and tests provided for VRT file content validation (in memory).
Codecov Report
Merging #213 into develop will decrease coverage by
0.07%
. The diff coverage is33.33%
.
@@ Coverage Diff @@
## develop #213 +/- ##
===========================================
- Coverage 66.8% 66.72% -0.08%
===========================================
Files 42 42
Lines 3190 3189 -1
===========================================
- Hits 2131 2128 -3
- Misses 1059 1061 +2
Impacted Files | Coverage Δ | |
---|---|---|
digitalearthau/testing/plugin.py | 80% <33.33%> (-7.1%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7425e06...f4a5dc7. Read the comment docs.
@ashoka1234, @omad I have a few questions/comments:
-
Do we really want 3 separate products?
- What is the reason for grouping bands by resolution into separate products? I'm pretty sure that a single product with bands of different resolution would work just fine.
-
Why custom indexing step instead of just writing yaml file next to vrt file and using
datacube dataset add
instead? -
Probably doesn't matter for a once off tool like this but still, I noticed you use file path as an API argument, as a result, same netcdf file is opened/closed many times. I think it would be more efficient but also clearer from the data flow perspective if those methods operated on some data loaded from file once. In this case tags and whatever other file metadata you require to generate VRT and YAML docs. This would also make writing tests easier as it is easier to mock a data structure than a whole netcdf file.
Thanks @Kirill888 for review comments. Answers to questions respectively:
- The decision to have three separate products was driven by the limitations that I saw with datacube and VRT files regarding the specification of Geo Transform. The reference given for computation of CRS adjust bounds used for geo transform values by half of pixel resolution and this resolution is dependent on sensor.
VRTRasterBand
don't seem to havegeo transform
elements so had to specify geo tranbsform atVRTDataset
level and did not see a way to specify number ofVRTDatasets
per VRT file. I also didn't see datacube specifygeo transforms
permeasurement
. If the geo transform is approximate enough without resolution dependent adjustment or VRT/datacube could specify geo transform per band then we could ideally use a single product. - Hm. for this one, I followed the style of
modis
indexing script but I agree with this style because it is simplistic but do you see any additional benefits of havingyamls
? - Yes you are right on this one and I did have this concern as well and, I had extracted data passed to those functions earlier ensuring
once read
as much as possible but i felt interface was messy and resorted to passing file paths. But I mitigated the multiple file read of band names by means offunctools.lru_cache
. Still some multiple file opens remains (I think multiple opens forsubdataset
is unavoidable! There is additional file open to extract metadata, but since this is done only once I thought that was OK. This is probably my opinion but I felt passing internal GDAL representations to functions was low level so I tried to avoid it. But I will have another look and see we could simplify this issue.
@ashoka1234
-
VRT generation is a separate problem from product definition, product can reference different VRT files for different bands. You are right about
dataset.transform
property being confusing and suggesting that all bands should cover the same area at the same resolution. The reality is that we already have products that have bands with slightly different footprints and different resolutions. Dataset footprint is supposed to be a union of individual band footprints, not "THE footprint for every band in the Dataset". -
Having files written out next to VRT would simplify indexing step. Users will want to index this product into their own custom databases. Having "the more standard" way of just calling
datacube dataset add
instead of learning how to run this custom tool would be preferable.
-
I hadn't realised we had products with different resolution bands. Since we do, this is definitely what we should use for this ASTER products.
-
We had originally specified to not write out YAMLS. However, for this case, since we have to write out and manage VRTs, we should also write out YAML files.
Scanning about 100 ASTER
dataset directories, the following are statistics:
Total datasets: 2505
vnir
bands present: 1434
swir
bands present: 682
tir
bands present: 2505
It seems 'tir
is present in all datasets checked. If this behaviour generalises to whole collection, then tir
could potentially provides minimal measurements to be included in product definition. But, I personally feel product definition should provide maximal
measurement info.
@ashoka1234 that's another datacube limitation.
I think the most sane way is to have exactly the same set of bands across both product and dataset definitions.
-
Having extra bands in a dataset (those not defined in the product) is completely useless, as they won't be loadable by
dc.load
. -
Listing all possible bands in a product definition and only including available bands in a given dataset, kinda makes sense, but is not supported by datacube. First of all indexing step will reject such dataset. Secondly there is probably a lot of code in datacube that assumes that any band name listed in a product definition will also appear in every dataset from this product.
I see 2 possible workaround here:
-
(this is my preferred choice) We generate VRT that replaces missing bands with whatever available construct in the VRT toolkit, then Dataset yaml section as far as bands go is straightforward and there is a consistent mapping from band name to band index within the VRT.
-
Yaml includes all bands but has a "broken path" for those bands that are missing. That way you can index and load data so long as you use
ignore_missing
option.
Advantage of option (1) is that it is simple to verify correctness of yaml as you don't need to keep track which netcdf variable ended up with which band index in the VRT.
Option (2) is basically a work around for datacube not having an option to indicate that a particular band is missing for this particular dataset.
@Kirill888 As per discussion we had, we will retain three products corresponding to three different sensors since these sensors can be on-off and depending on that, the corresponding bands are present in the dataset. I will also make sure band mappings are consistent.
As for consistency of band mappings the code currently rely on consistent ordering of subdatasets
returned by the corresponding GDAL
query. All the dataset files I checked seemed to have the same consistent order! But may be we should not rely on this GDAL
feature.
Yes for consistency I sort the bands defined in the global constant PRODUCT
and assign integer value as a string in that sorted order.
Looking pretty good. I don't think the tests are being run though. I can see that test_index_modis_lpdaac
is run, but nothing for aster.
Try moving the modis test script into scripts_tests/
and add that path at the end of check-code.sh
when pytest is run.
OK sure will do! I thought it was unnecessary for scripts tests to run on Travis!
NP. I think that if we've gone to the effort of writing tests, we might as well do a bit more and have them run all the time in case anything breaks. :)