datacube-core
datacube-core copied to clipboard
Dataset.measurements is a list of dicts not a dict
Hi there! While trying to index some data got an error where datacube is calling keys
on a list. It looks like the issue is due to Dataset.metadata_doc['measurements']
storing a list of dicts in my case, which gets stored in Dataset.measurements
causing an error when check_dataset_consistent
trys to check the measurements.
The fix seems fairly straightforward, just changing the code in Dataset.measurements
to:
@property
def measurements(self) -> Dict[str, Any]:
# It's an optional field in documents.
# Dictionary of key -> measurement descriptor
if not hasattr(self.metadata, 'measurements'):
return {}
return self.metadata.measurements[0]
Worked for me. My example only has one measurement so there might be extra work needed to handle multiple measurements. Or maybe my product document is invalid and this syntax error could be caught earlier on. I'm pretty keen to open a PR if a fix is appropritate!
Expected behaviour
Dataset.measurements
stores a dictionary and datacube dataset add ...
works.
Actual behaviour
Traceback (most recent call last):
File "/Users/kieranricardo/anaconda3/envs/odc/bin/datacube", line 10, in <module>
sys.exit(cli())
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/ui/click.py", line 197, in new_func
return f(parsed_config, *args, **kwargs)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/ui/click.py", line 229, in with_index
return f(index, *args, **kwargs)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 178, in index_cmd
run_it(pp)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 173, in run_it
dry_run=dry_run)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 184, in index_datasets
for dataset in dss:
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 49, in dataset_stream
dataset, err = ds_resolve(ds, uri)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/index/hl.py", line 277, in __call__
is_consistent, reason = check_dataset_consistent(dataset)
File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/index/hl.py", line 103, in check_dataset_consistent
if not product_measurements.issubset(dataset.measurements.keys()):
AttributeError: 'list' object has no attribute 'keys'
Steps to reproduce the behaviour
Using my product and dataset files locally I can reproduce this error by:
-
datacube product add imagery_product.yml
-
datacube dataset add imagery_documents.yml
Environment information
- Which
datacube --version
are you using? Open Data Cube core, version 1.8.0 - What datacube deployment/enviornment are you running against? Local datacube deployment using the setup as in https://datacube-core.readthedocs.io/en/latest/ops/db_setup.html
Extra info:
- Python 3.6.10
- MacOS
Edit: Updated with more accurate info
@kieranricardo can you please provide a sample of your dataset and product yamls?
What is most likely happening is that your dataset yaml has measurements defined as a list, while datacube expects a dictionary from band name to a dict
describing band files. It is is a bit confusing, since product definition needs to have measurements defined as list 🤦.
Unfortunately dataset add
doesn't perform yaml document validation to a sufficient degree to report this error at index time, so you get error at run-time instead.
https://datacube-core.readthedocs.io/en/latest/ops/dataset_documents.html https://datacube-core.readthedocs.io/en/latest/ops/product.html
@Kirill888 thanks for the speedy reply! Ah yep I'm using lists in both my product and dataset yamls 🤦 I was following the first example here: https://datacube-core.readthedocs.io/en/latest/ops/indexing.html
Here's my product yml for reference:
name: prock
description: Outer Darwin Harbour Marine Survey 2015 p-rock (probability of rock) grid
metadata_type: eo3
license: Creative
metadata:
format:
name: GeoTIFF
measurements:
- name: prock
dtype: uint8
nodata: NaN
units: 'probability'
And my dataset yaml:
# UUID of the dataset
id: f884df9b-4458-47fd-a9d2-1a52a2db8a1a
$schema: 'https://schemas.opendatacube.org/dataset'
# Product name
product:
name: prock
format:
name: GeoTIFF
crs: "epsg:32752"
grids:
default:
shape: [5216, 8827]
transform: [6.64741850989687, 0.0, 661456.1121158252, 0.0, -6.647669545339101, 8659755.72833947, 0.0, 0.0, 1.0]
measurements:
- prock:
grid: "default"
path: "prock6.tif"
- dummy:
grid: "default"
path: "prock6.tif"
# Timestamp is the only compulsory field here
properties:
# ODC specific "extensions"
odc:processing_datetime: 2020-02-02T08:10:00.000Z
# Lineage only references UUIDs of direct source datasets
# Mapping name:str -> [UUID]
lineage: {} # set to empty object if no lineage is defined
@Kirill888 is there a nice way to delete product definitions? so far I've resorted to just manually deleting rows form postgres
@kieranricardo changing data in place is a bit of a sore point in datacube, not really well supported. The DB layer basically assumes append only operations or "approximately append only".
You CAN modify products and dataset documents in place, but you need to supply extra command line flags to allow "unsafe changes". There is no delete functionality for anything, there is dataset archive, but it's not what you want in this case.
Some of those limitations come from the lineage tracking functionality, deleting dataset that is referenced by derived dataset should not be allowed, but we should allow deletion of datasets that are not referenced by anyone, currently not implemented though.
Thanks, the "unsafe changes" is what I was looking for! Although it would be nice to be able to safely update a product/dataset inplace if it isn't being referenced. Would you be open to PRs implementing these?
I can't say I fully comprehend where the boundary between safe and unsafe changes lie for product definition and for dataset documents. I also suspect that the boundary depends on the context that can not be captured by the database itself. For example, if you are still in the "bootstrapping stages" and haven't started using the database, any change that maintains database consistency rules should be OK. If however this is a large installation with long history of use, then situation is very different.
I believe current definition of "unsafe", as captured by the implementation (can't really point to any documents on that) applies more to the second case and errors on the side of caution. So, you probably should not worry about "unsafe" changes too much.
Having said that, PRs are welcome. In particular tooling for "undo" operations, that are so handy in early development stages, but are missing. Things like "delete dataset", "delete datasets that match certain criteria", "delete product and all its datasets". Those are relatively straightforward from SQL side of things, but there might be complications due to db abstraction layer in the datacube-core.
Probably easiest is to start at "SQL layer", i.e. assuming that db structure is fixed (it kinda is) and start from there. For this kind of work I recommend doing it here: https://github.com/opendatacube/odc-tools/tree/master/libs/index/odc/index rather than in this datacube-core
repo itself.
@kieranricardo by the way you should be using just datetime: ...
and not odc:processing_datetime: ...
to specify timestamp, the later is for "dataset generation time", but what you really need to supply to datacube is "what time were pixels captured at", and that goes into datetime
key, or if it's a time range, it goes into dtr:start_datetime:
and dtr:end_datetime:
thanks for your help @Kirill888. I'll make an issue (if there isn't one already) and PR for some delete tooling in https://github.com/opendatacube/odc-tools/tree/master/libs/index/odc/index.
One other question, I'm having trouble with specify a lineage in eo3 format. There's examples in the docs for eo lineage but I couldn't find any for eo3. I'm trying to add the dataset document i shared to the lineage of another dataset like so:
lineage: {"parent": ["f884df9b-4458-47fd-a9d2-1a52a2db8a1a"]}
But i get:
ERROR Inconsistent lineage dataset f884df9b-4458-47fd-a9d2-1a52a2db8a1a
> $schema: missing!='https://schemas.opendatacube.org/dataset', crs: missing!='epsg:32752', extent: missing!={'lat': {'end': -12.116431349076615, 'begin': -12.433299809465927}, 'lon': {'end': 131.02506209614575, 'begin': 130.48367254613058}}, format: missing!={'name': 'GeoTIFF'}, grid_spatial: missing!={'projection': {'geo_ref_points': {'ll': {'x': 661456.1121158252, 'y': 8625081.483990982}, 'lr': {'x': 720132.8753026848, 'y': 8625081.483990982}, 'ul': {'x': 661456.1121158252, 'y': 8659755.72833947}, 'ur': {'x': 720132.8753026848, 'y': 8659755.72833947}}, 'spatial_reference': 'epsg:32752'}}, grids: missing!={'default': {'shape': [5216, 8827], 'transform': [6.64741850989687, 0.0, 661456.1121158252, 0.0, -6.647669545339101, 8659755.72833947, 0.0, 0.0, 1.0]}}, measurements: missing!={'prock': {'grid': 'default', 'path': 'prock6.tif'}}, product: missing!={'name': 'prock'}, properties: missing!={'odc:processing_datetime': '2020-02-02T08:10:00.000Z'}
Do you know what's going on here?
@kieranricardo you doing right thing with respect to lineage. The code should be smarter when dealing with EO3 though. You need to use datacube dataset add --no-verify-lineage ...
when indexing EO3, essentially since EO3 doesn't include lineage document in the derived document yaml, there is nothing to verify.
We should update docs, or better auto skip verification step when source dataset is EO3 in code.
EO3 is a kinda bolt-on intermediate step, and a very recent addition, so...
better auto skip verification step when source dataset is EO3 in code
this would be nice! I'll make an issue for this
filed https://github.com/opendatacube/datacube-core/issues/956
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We think this has been resolved.