datacube-core
datacube-core copied to clipboard
Querying data around the anti-meridian
Expected behaviour
dc.find_datasets()
should return a non-empty set when querying data that belongs to a scene spanning across the anti-meridian. Similarly, dc.load()
should return raster data.
Actual behaviour
dc.find_datasets()
returns []
(see steps below).
Steps to reproduce the behaviour
-
Add product definition:
datacube product add https://raw.githubusercontent.com/opendatacube/datacube-dataset-config/master/products/ls_usgs_level1_scene.yaml
-
Index a LS8 scene spanning the anti-meridian:
wget https://raw.githubusercontent.com/opendatacube/cube-in-a-box/master/scripts/ls_public_bucket.py python3 ls_public_bucket.py landsat-pds \ --prefix c1/L8/074/071/LC08_L1TP_074071_20190622_20190704_01_T1 \ --suffix MTL.txt \ --start_date 2019-01-01 \ --end_date 2019-12-31
From the MTL:
CORNER_UL_LAT_PRODUCT = -14.85428 CORNER_UL_LON_PRODUCT = 178.09959 CORNER_UR_LAT_PRODUCT = -14.83404 CORNER_UR_LON_PRODUCT = -179.75772 CORNER_LL_LAT_PRODUCT = -16.97207 CORNER_LL_LON_PRODUCT = 178.11117 CORNER_LR_LAT_PRODUCT = -16.94878 CORNER_LR_LON_PRODUCT = -179.72364
-
Query the data (note that the AoI doesn't span across the anti-meridian):
import datacube dc = datacube.Datacube(app='issue-example') query = {} query['product'] = 'ls8_usgs_level1_scene' # for this example we will load some data from Landsat 8 query['group_by'] = 'solar_day' # datasets can often overlap, we will combine all data that occurs on the same day query['time'] = '2019' query['x'] = (179.0, 180.0) query['y'] = (-17.0, -16.0) dc.find_datasets(**query) # use the query we defined above
-
The above query will only work if one extends the longitude range to the whole strip:
query['x'] = (-180.0, 180.0)
Environment information
-
Which
datacube --version
are you using?Open Data Cube core, version 1.7 datacube 1.7 py_1 conda-forge
-
What datacube deployment/enviornment are you running against?
On-premise ODC Hub at the Satellite Applications Catapult.
It should be noted that using a local coordinate system without discontinuity for querying, both dc.find_datasets()
and dc.load()
might work fine, depending on the AoI.
E.g. for the AoI described above, the EPSG:3460 equivalent is:
query['x'] = (2026601, 2133765)
query['y'] = (3999978, 4110240)
query['crs'] = 'EPSG:3460'
Querying using the above extents works correctly. However, there seems to be some performance trade-off associated to querying in a CRS other than EPSG:4326.
When using 'EPSG:4326' lon/lat values in dc.find_datasets()
and dc.load()
queries, another way the root issue manifests itself is as follows.
If the AoI doesn't intersect the bounding box of datasets that span across the antimeridian but the latter are within the latitude extents of the AoI, then they are returned by both queries.
For example, if I call dc.find_datasets()
for Vanuatu in our ODC Hub using the following extents:
query['x'] = (166.0, 170.5)
query['y'] = (-20.15, -12.5)
I get, among the correct results, datasets that are outside of the longitude extents of the query AND cross the anti-meridian. E.g.:
Dataset <id=5e546456-7a14-5f2b-a25e-e3a78d1a3d5e type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.99_-15.51_-179.49_-14.99_datacube-metadata.yaml>,
Dataset <id=11ccd068-c6de-562b-bda2-913e0fc4b59c type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.99_-16.01_-179.49_-15.49_datacube-metadata.yaml>,
Dataset <id=55f6a647-3b29-520f-8ef0-1c61b1228662 type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.49_-16.01_-179.99_-15.49_datacube-metadata.yaml>,
In short, as soon as the ODC indexes datasets that span the anti-meridian, even queries "far away" from the anti-meridian are affected if such queries request latitude extents in which the former datasets have data, regardless of their longitude extents compared to the longitude extents of the query.
Seems to be a case of longitude value reordering, for which e.g. the segment from lon/lat 179.0/0.0 to -179.0/0.0 gets rewritten as -179.0/0.0 to 179.0/0.0. The former is just across the anti-meridian while the latter goes all around the globe. The point 179.0/0.0 belongs to the former but doesn't belong to the latter. The point 170.0/0.0 doesn't belong the the former but belongs to the latter. So all queries that check whether a point is within the original segment or not return the opposite result of what one expects if the segment's from/to coordinates are swapped. Extend this concept to 2D geometry and you are now checking for the intersection of a dataset bounding box and an AoI: queries return opposite results of what's expected.
Alright I can confirm, but have no fix to offer. Although I have an explanation of what is going on:
Firstly the indexing script is not doing a good job of computing lat/lon bounds:
"extent": { "coord": {
"ll": {"lat": 178.1111678295364, "lon": -16.97206849103723},
"lr": {"lat": -179.7236392320994,"lon": -16.948779916217106},
"ul": {"lat": 178.09958848792087,"lon": -14.854281278240947},
"ur": {"lat": -179.7577169684917,"lon": -14.834037287434578}}
This then translates to a valid span of dataset along lat
as
-179.7577169684917, 178.1111678295364
(ds.metadata.lat
). So essentially it covers everything that this dataset doesn't cover and excludes exactly the range this dataset does cover.
Native footprint of this dataset is this:
from datacube.utils.geometry import box
poly = box(618300, -1876800,
849000, -1642500,
'EPSG:32660')
display(poly.crs, poly.geom)
Need to figure out what how to produce proper extent from that, given antimeridian issue, then index corrected version and then see if things are still broken afterwards, and I won't be surprised that they will be, since not a lot of people are operating in the Pacific region currently.
Appendix
Native footprint is as following:
"grid_spatial": {
"projection": {
"geo_ref_points": {
"ll": {
"x": 618300.0,
"y": -1876800.0
},
"lr": {
"x": 849000.0,
"y": -1876800.0
},
"ul": {
"x": 618300.0,
"y": -1642500.0
},
"ur": {
"x": 849000.0,
"y": -1642500.0
}
},
"spatial_reference": "EPSG:32660"
}
Axis swapping because of GDAL3, in prepare script, at least on my side, which GDAL did you use?
After swapping lat<->lon
and adding 360 to negative longitude values, I have been able to query the dataset.
- Find dataset without spatial constraints
- Dump
ds.metadata_doc
to file in json format - Edit and fix extent subtree
- Update database version
datacube dataset update ds-fixed.json --allow-any=extent.coord
- Then repeat the query
We are using GDAL 2.4.1
After swapping lat<->lon and adding 360 to negative longitude values, I have been able to query the dataset.
What coordinate system did you use for querying, EPSG:32660 or EPSG:4326? Do you see a performance difference (total load time) when using one or the other?
Also, adding 360 degrees to negative longitude values moves the discontinuity to the Prime meridian so it should be done selectively.
Also, do you reckon this issue also explains the following behavior? https://medium.com/@luigidifraia/querying-satellite-imagery-at-the-anti-meridian-using-the-open-data-cube-661ba1cd553a
Also, adding 360 degrees to negative longitude values moves the discontinuity to the Prime meridian so it should be done selectively.
I meant to say that I did it manually in an editor just to check what a "correct lon range" should be and whether querying will work in that case. This is not a proposed solution for fixing lat/lon bbox computation in the prepare script.
Firstly the indexing script is not doing a good job of computing lat/lon bounds:
"extent": { "coord": { "ll": {"lat": 178.1111678295364, "lon": -16.97206849103723}, "lr": {"lat": -179.7236392320994,"lon": -16.948779916217106}, "ul": {"lat": 178.09958848792087,"lon": -14.854281278240947}, "ur": {"lat": -179.7577169684917,"lon": -14.834037287434578}}
What do you mean "not doing a good job"? The above corners seem to match MTL contents? Here again for reference:
CORNER_UL_LAT_PRODUCT = -14.85428
CORNER_UL_LON_PRODUCT = 178.09959
CORNER_UR_LAT_PRODUCT = -14.83404
CORNER_UR_LON_PRODUCT = -179.75772
CORNER_LL_LAT_PRODUCT = -16.97207
CORNER_LL_LON_PRODUCT = 178.11117
CORNER_LR_LAT_PRODUCT = -16.94878
CORNER_LR_LON_PRODUCT = -179.72364
This then translates to a valid span of dataset along
lat
as-179.7577169684917, 178.1111678295364
That's what appears to be wrong, i.e. the translation.
@luigidifraia this is my main problem with EO format it confuses everyone into thinking that extent is supposed to contain the four corners of the dataset but in Lon/Lat. And that's NOT CORRECT. Extent section is supposed to capture bounding box of the dataset in Lon/Lat, but for some reason insists on using 4 points to represent that, leading to confusion. Also if you read carefully you'll see that lat and lon coordinates are swapped (this is a GDAL3) problem.
Now if you treat above as range you get range for lon
(this is after fixing axis swapping) being from -179.758 to 178.111, but it should be from 178.111 to 360-179.758. But fixing prepare script is not enough, querying code will make the same mistake when converting query that spans lon=180
line.
@Kirill888 Yes, the GDAL3 lon/lat swapping problem has been noted before.
Also noted re extents.
Do you reckon the one discussed here is also the root issue causing the unexpected behavior discussed in my medium post (moving query extents can cause datasets not being returned by the dc)? Appreciate you might not have a similar environment to reproduce so it's more of a request for your opinion rather than for troubleshooting.
With the current state of the prepare script and querying code I'm surprised you didn't see more brokenness than you report.
We already established that metadata is not recorded correctly in the database: the range for lon
axis is recorded as inverse of the real range for datasets that span lon=180
line. This means that any query that falls within the dataset's valid range will not find that dataset. However such dataset will still be found if query goes outside of the valid range of the dataset though. Remember that database only filters datasets based on Lon/Lat bounds (as recorded in the database), there is no GIS query. Further filtering is then applied by Python code to remove datasets that do not overlap with the original query which can be in any projection and can use arbitrary geometry. This filtering happens in the CRS of the query, so the assumption is that for every dataset found by the database its footprint can be re-reprojected into the query space without issues. Query polygon and dataset polygon are then tested for overlap. Datasets that do not overlap with query polygon are dropped, user/loading code never sees them.
So to summarise:
- Database records lon/lat ranges per dataset (:x: broken)
- User Query is transformed into lon/lat range (:x: broken for projected queries that span
lon=180
) - Dataset native footprint is transformed to query space (:x: broken for datasets that span
lon=180
when query is inepsg:4326
, you end up with self-crossing geometries) - Compute overlap between Query and Dataset polygons (:x: broken for lon/lat queries when dataset spans
lon=180
because of (3))
So whether you query in UTM or in lat/lon you'll hit a problem even if you fix the prepare script and record per-datasets ranges correctly.
This is where (2) is broken: https://github.com/opendatacube/datacube-core/blob/183257f1be39c87f6d0157cae29f2f855f07a8cf/datacube/api/query.py#L114-L123
Code above needs to be moved out into a testable function and should handle lon=180
crossing case in a robust fashion that does not add too much computational cost for the "common case".
Example of incorrect computation of lon
range:
from datacube.utils.geometry import box
from datacube.api.query import Query
poly = box(618300, -1876800,
849000, -1642500,
'EPSG:32660')
q = Query(geopolygon=poly)
print(f'Lat: {q.search_terms["lat"]}')
print(f'Lon: {q.search_terms["lon"]}')
>>>
Lat: Range(begin=-16.97206849103723, end=-14.834037287434578)
Lon: Range(begin=-179.7577169684917, end=179.95735806974955)
Steps (3) and (4) are broken here:
https://github.com/opendatacube/datacube-core/blob/183257f1be39c87f6d0157cae29f2f855f07a8cf/datacube/api/core.py#L676-L682
Specifically dataset.extent.to_crs(query_crs)
produces wrong result when query_crs
is geographic and dataset.extent
crosses lon=180
.
For future development: would it make sense to let users decide whether to use EPSG:4326 (default) or a different coordinate system when initializing the datacube system, effectively avoiding 2) and 4) (and additionally 3) if the query uses EPSG:4326)? That would not require customizing the code for edge cases thus impacting performance for the common case (nowhere near lon=180)? Although I guess that you would still have issues if there's a coordinate systems for which latitude values increase going South.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
there was some progress on this, see referenced commits. Lon/lat bounds are now computed properly, but I believe steps (3) and (4) are still broken.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.