datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

Querying data around the anti-meridian

Open luigidifraia opened this issue 4 years ago • 18 comments

Expected behaviour

dc.find_datasets() should return a non-empty set when querying data that belongs to a scene spanning across the anti-meridian. Similarly, dc.load() should return raster data.

Actual behaviour

dc.find_datasets() returns [] (see steps below).

Steps to reproduce the behaviour

  • Add product definition:

    datacube product add https://raw.githubusercontent.com/opendatacube/datacube-dataset-config/master/products/ls_usgs_level1_scene.yaml
    
  • Index a LS8 scene spanning the anti-meridian:

    wget https://raw.githubusercontent.com/opendatacube/cube-in-a-box/master/scripts/ls_public_bucket.py
    
    python3 ls_public_bucket.py landsat-pds \
    --prefix c1/L8/074/071/LC08_L1TP_074071_20190622_20190704_01_T1 \
    --suffix MTL.txt \
    --start_date 2019-01-01 \
    --end_date 2019-12-31
    

    From the MTL:

    CORNER_UL_LAT_PRODUCT = -14.85428
    CORNER_UL_LON_PRODUCT = 178.09959
    CORNER_UR_LAT_PRODUCT = -14.83404
    CORNER_UR_LON_PRODUCT = -179.75772
    CORNER_LL_LAT_PRODUCT = -16.97207
    CORNER_LL_LON_PRODUCT = 178.11117
    CORNER_LR_LAT_PRODUCT = -16.94878
    CORNER_LR_LON_PRODUCT = -179.72364
    
  • Query the data (note that the AoI doesn't span across the anti-meridian):

    import datacube
    
    dc = datacube.Datacube(app='issue-example')
    
    query = {}
    query['product'] = 'ls8_usgs_level1_scene'  # for this example we will load some data from Landsat 8
    query['group_by'] = 'solar_day'  # datasets can often overlap, we will combine all data that occurs on the same day
    
    query['time'] = '2019'
    
    query['x'] = (179.0, 180.0)
    query['y'] = (-17.0, -16.0)
    
    dc.find_datasets(**query)  # use the query we defined above
    
  • The above query will only work if one extends the longitude range to the whole strip:

    query['x'] = (-180.0, 180.0)

Environment information

  • Which datacube --version are you using?

    Open Data Cube core, version 1.7
    datacube                  1.7                        py_1    conda-forge
    
  • What datacube deployment/enviornment are you running against?

    On-premise ODC Hub at the Satellite Applications Catapult.

luigidifraia avatar Mar 06 '20 11:03 luigidifraia

It should be noted that using a local coordinate system without discontinuity for querying, both dc.find_datasets() and dc.load() might work fine, depending on the AoI.

E.g. for the AoI described above, the EPSG:3460 equivalent is:

query['x'] = (2026601, 2133765)
query['y'] = (3999978, 4110240)
query['crs'] = 'EPSG:3460'

Querying using the above extents works correctly. However, there seems to be some performance trade-off associated to querying in a CRS other than EPSG:4326.

luigidifraia avatar Mar 09 '20 16:03 luigidifraia

When using 'EPSG:4326' lon/lat values in dc.find_datasets() and dc.load() queries, another way the root issue manifests itself is as follows. If the AoI doesn't intersect the bounding box of datasets that span across the antimeridian but the latter are within the latitude extents of the AoI, then they are returned by both queries. For example, if I call dc.find_datasets() for Vanuatu in our ODC Hub using the following extents:

query['x'] = (166.0, 170.5)
query['y'] = (-20.15, -12.5)

I get, among the correct results, datasets that are outside of the longitude extents of the query AND cross the anti-meridian. E.g.:

Dataset <id=5e546456-7a14-5f2b-a25e-e3a78d1a3d5e type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.99_-15.51_-179.49_-14.99_datacube-metadata.yaml>,
Dataset <id=11ccd068-c6de-562b-bda2-913e0fc4b59c type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.99_-16.01_-179.49_-15.49_datacube-metadata.yaml>,
Dataset <id=55f6a647-3b29-520f-8ef0-1c61b1228662 type=s2_epsg3460_geomedian_annual location=s3://public-eo-data/common_sensing/fiji/sentinel_2_geomedian/2019/s2_geomedian_2019-01-01_2019-12-31_epsg3460_179.49_-16.01_-179.99_-15.49_datacube-metadata.yaml>,

In short, as soon as the ODC indexes datasets that span the anti-meridian, even queries "far away" from the anti-meridian are affected if such queries request latitude extents in which the former datasets have data, regardless of their longitude extents compared to the longitude extents of the query.

Seems to be a case of longitude value reordering, for which e.g. the segment from lon/lat 179.0/0.0 to -179.0/0.0 gets rewritten as -179.0/0.0 to 179.0/0.0. The former is just across the anti-meridian while the latter goes all around the globe. The point 179.0/0.0 belongs to the former but doesn't belong to the latter. The point 170.0/0.0 doesn't belong the the former but belongs to the latter. So all queries that check whether a point is within the original segment or not return the opposite result of what one expects if the segment's from/to coordinates are swapped. Extend this concept to 2D geometry and you are now checking for the intersection of a dataset bounding box and an AoI: queries return opposite results of what's expected.

luigidifraia avatar Mar 17 '20 19:03 luigidifraia

Alright I can confirm, but have no fix to offer. Although I have an explanation of what is going on:

Firstly the indexing script is not doing a good job of computing lat/lon bounds:

"extent": {  "coord": {
      "ll": {"lat": 178.1111678295364, "lon": -16.97206849103723},
      "lr": {"lat": -179.7236392320994,"lon": -16.948779916217106},
      "ul": {"lat": 178.09958848792087,"lon": -14.854281278240947},
      "ur": {"lat": -179.7577169684917,"lon": -14.834037287434578}}

This then translates to a valid span of dataset along lat as

-179.7577169684917, 178.1111678295364

(ds.metadata.lat). So essentially it covers everything that this dataset doesn't cover and excludes exactly the range this dataset does cover.

Native footprint of this dataset is this:

from datacube.utils.geometry import box
poly = box(618300, -1876800,
           849000, -1642500, 
           'EPSG:32660')
display(poly.crs, poly.geom)

Need to figure out what how to produce proper extent from that, given antimeridian issue, then index corrected version and then see if things are still broken afterwards, and I won't be surprised that they will be, since not a lot of people are operating in the Pacific region currently.

Appendix

Native footprint is as following:

"grid_spatial": {
    "projection": {
      "geo_ref_points": {
        "ll": {
          "x": 618300.0,
          "y": -1876800.0
        },
        "lr": {
          "x": 849000.0,
          "y": -1876800.0
        },
        "ul": {
          "x": 618300.0,
          "y": -1642500.0
        },
        "ur": {
          "x": 849000.0,
          "y": -1642500.0
        }
      },
      "spatial_reference": "EPSG:32660"
    }

Kirill888 avatar Apr 17 '20 07:04 Kirill888

Axis swapping because of GDAL3, in prepare script, at least on my side, which GDAL did you use?

Kirill888 avatar Apr 17 '20 07:04 Kirill888

After swapping lat<->lon and adding 360 to negative longitude values, I have been able to query the dataset.

  1. Find dataset without spatial constraints
  2. Dump ds.metadata_doc to file in json format
  3. Edit and fix extent subtree
  4. Update database version datacube dataset update ds-fixed.json --allow-any=extent.coord
  5. Then repeat the query

Kirill888 avatar Apr 17 '20 07:04 Kirill888

We are using GDAL 2.4.1

After swapping lat<->lon and adding 360 to negative longitude values, I have been able to query the dataset.

What coordinate system did you use for querying, EPSG:32660 or EPSG:4326? Do you see a performance difference (total load time) when using one or the other?

Also, adding 360 degrees to negative longitude values moves the discontinuity to the Prime meridian so it should be done selectively.

luigidifraia avatar Apr 17 '20 08:04 luigidifraia

Also, do you reckon this issue also explains the following behavior? https://medium.com/@luigidifraia/querying-satellite-imagery-at-the-anti-meridian-using-the-open-data-cube-661ba1cd553a

luigidifraia avatar Apr 17 '20 08:04 luigidifraia

Also, adding 360 degrees to negative longitude values moves the discontinuity to the Prime meridian so it should be done selectively.

I meant to say that I did it manually in an editor just to check what a "correct lon range" should be and whether querying will work in that case. This is not a proposed solution for fixing lat/lon bbox computation in the prepare script.

Kirill888 avatar Apr 20 '20 00:04 Kirill888

Firstly the indexing script is not doing a good job of computing lat/lon bounds:

"extent": {  "coord": {
      "ll": {"lat": 178.1111678295364, "lon": -16.97206849103723},
      "lr": {"lat": -179.7236392320994,"lon": -16.948779916217106},
      "ul": {"lat": 178.09958848792087,"lon": -14.854281278240947},
      "ur": {"lat": -179.7577169684917,"lon": -14.834037287434578}}

What do you mean "not doing a good job"? The above corners seem to match MTL contents? Here again for reference:

CORNER_UL_LAT_PRODUCT = -14.85428
CORNER_UL_LON_PRODUCT = 178.09959
CORNER_UR_LAT_PRODUCT = -14.83404
CORNER_UR_LON_PRODUCT = -179.75772
CORNER_LL_LAT_PRODUCT = -16.97207
CORNER_LL_LON_PRODUCT = 178.11117
CORNER_LR_LAT_PRODUCT = -16.94878
CORNER_LR_LON_PRODUCT = -179.72364

This then translates to a valid span of dataset along lat as

-179.7577169684917, 178.1111678295364

That's what appears to be wrong, i.e. the translation.

luigidifraia avatar Apr 20 '20 13:04 luigidifraia

@luigidifraia this is my main problem with EO format it confuses everyone into thinking that extent is supposed to contain the four corners of the dataset but in Lon/Lat. And that's NOT CORRECT. Extent section is supposed to capture bounding box of the dataset in Lon/Lat, but for some reason insists on using 4 points to represent that, leading to confusion. Also if you read carefully you'll see that lat and lon coordinates are swapped (this is a GDAL3) problem.

Now if you treat above as range you get range for lon (this is after fixing axis swapping) being from -179.758 to 178.111, but it should be from 178.111 to 360-179.758. But fixing prepare script is not enough, querying code will make the same mistake when converting query that spans lon=180 line.

Kirill888 avatar Apr 20 '20 23:04 Kirill888

@Kirill888 Yes, the GDAL3 lon/lat swapping problem has been noted before.

Also noted re extents.

Do you reckon the one discussed here is also the root issue causing the unexpected behavior discussed in my medium post (moving query extents can cause datasets not being returned by the dc)? Appreciate you might not have a similar environment to reproduce so it's more of a request for your opinion rather than for troubleshooting.

luigidifraia avatar Apr 21 '20 07:04 luigidifraia

With the current state of the prepare script and querying code I'm surprised you didn't see more brokenness than you report.

We already established that metadata is not recorded correctly in the database: the range for lon axis is recorded as inverse of the real range for datasets that span lon=180 line. This means that any query that falls within the dataset's valid range will not find that dataset. However such dataset will still be found if query goes outside of the valid range of the dataset though. Remember that database only filters datasets based on Lon/Lat bounds (as recorded in the database), there is no GIS query. Further filtering is then applied by Python code to remove datasets that do not overlap with the original query which can be in any projection and can use arbitrary geometry. This filtering happens in the CRS of the query, so the assumption is that for every dataset found by the database its footprint can be re-reprojected into the query space without issues. Query polygon and dataset polygon are then tested for overlap. Datasets that do not overlap with query polygon are dropped, user/loading code never sees them.

So to summarise:

  1. Database records lon/lat ranges per dataset (:x: broken)
  2. User Query is transformed into lon/lat range (:x: broken for projected queries that span lon=180)
  3. Dataset native footprint is transformed to query space (:x: broken for datasets that span lon=180 when query is in epsg:4326, you end up with self-crossing geometries)
  4. Compute overlap between Query and Dataset polygons (:x: broken for lon/lat queries when dataset spans lon=180 because of (3))

So whether you query in UTM or in lat/lon you'll hit a problem even if you fix the prepare script and record per-datasets ranges correctly.

Kirill888 avatar Apr 21 '20 23:04 Kirill888

This is where (2) is broken: https://github.com/opendatacube/datacube-core/blob/183257f1be39c87f6d0157cae29f2f855f07a8cf/datacube/api/query.py#L114-L123

Code above needs to be moved out into a testable function and should handle lon=180 crossing case in a robust fashion that does not add too much computational cost for the "common case".

Example of incorrect computation of lon range:

from datacube.utils.geometry import box
from datacube.api.query import Query

poly = box(618300, -1876800,
           849000, -1642500, 
           'EPSG:32660')
q = Query(geopolygon=poly)
print(f'Lat: {q.search_terms["lat"]}')
print(f'Lon: {q.search_terms["lon"]}')

>>>

Lat: Range(begin=-16.97206849103723, end=-14.834037287434578)
Lon: Range(begin=-179.7577169684917, end=179.95735806974955)

Kirill888 avatar Apr 22 '20 02:04 Kirill888

Steps (3) and (4) are broken here:

https://github.com/opendatacube/datacube-core/blob/183257f1be39c87f6d0157cae29f2f855f07a8cf/datacube/api/core.py#L676-L682

Specifically dataset.extent.to_crs(query_crs) produces wrong result when query_crs is geographic and dataset.extent crosses lon=180.

Kirill888 avatar Apr 22 '20 02:04 Kirill888

For future development: would it make sense to let users decide whether to use EPSG:4326 (default) or a different coordinate system when initializing the datacube system, effectively avoiding 2) and 4) (and additionally 3) if the query uses EPSG:4326)? That would not require customizing the code for edge cases thus impacting performance for the common case (nowhere near lon=180)? Although I guess that you would still have issues if there's a coordinate systems for which latitude values increase going South.

luigidifraia avatar Apr 22 '20 07:04 luigidifraia

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 20 '20 07:08 stale[bot]

there was some progress on this, see referenced commits. Lon/lat bounds are now computed properly, but I believe steps (3) and (4) are still broken.

Kirill888 avatar Aug 20 '20 07:08 Kirill888

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 18 '20 08:12 stale[bot]