rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Aligning rioxarray GCPs with CF Conventions

Open emmanuelmathot opened this issue 4 months ago • 10 comments

The Issue

rioxarray currently implements Ground Control Points (GCPs) using a custom GeoJSON approach, while CF conventions section 8.3 defines a standardized method for "coordinate subsampling" (tie points). This creates a potential alignment opportunity for better standards compliance and ecosystem interoperability.

Key Differences

rioxarray's approach: Stores discrete coordinate mappings as GeoJSON within CF-compliant grid mapping coordinates. Simple, lightweight, and pragmatic.

CF section 8.3: Defines interpolated coordinate fields using tie point variables with standardized algorithms. More complex but enables data compression and mathematical interpolation between points.

Why the Divergence?

rioxarray GCPs management was developed right before CF section 8.3 existed - it was only added in CF-1.9 after extensive community discussion. The GeoJSON solution was a reasonable choice given the standards gap at the time.

Alignment Benefits

  • Standards compliance: Full CF convention conformance
  • Ecosystem interoperability: Better integration with climate/geoscience tools
  • Future-proofing: Alignment with the emerging GeoZarr specification
  • Data compression: CF tie points can significantly reduce coordinate storage size

Implementation Challenges

  • Complexity: CF section 8.3 requires interpolation algorithms and extensive metadata
  • Performance: Coordinate interpolation adds computational overhead
  • Backwards compatibility: Must maintain existing GCP API during transition
  • Limited adoption: Few tools currently support CF coordinate subsampling

Questions

  • Is this alignment technically feasible?
  • Are there technical blockers not yet identified?
  • Could we prototype CF tie points support alongside existing methods?

Key Sources

cc @snowman2 @mraspaud

emmanuelmathot avatar Sep 03 '25 06:09 emmanuelmathot

Thanks for bringing this up and for your very helpful issue. This sounds like the path forward for GCPs. Contributions welcome!

snowman2 avatar Sep 03 '25 19:09 snowman2

@Kirill888

snowman2 avatar Sep 03 '25 19:09 snowman2

Thanks for a ping @snowman2

@emmanuelmathot this looks like a rather complex system of defining reversible mapping between image and world planes. This would require a Python module of it's own to implement. There are currently too many unknowns to consider any implementation seriously

  • Is there reference implementation?
  • Is there sample data?
    • for all possible sub-sampling methods
    • with and without missing data
    • covering various edge cases
  • Is there production data?
    • Big enough to take advantage of blocks and compression
  • Does netcdf library currently support this?
    • Does it expose low level metadata, or just points/coord arrays after "unpacking"?
    • Does it support creation from "raw metadata"?
    • How does this interact with "windowed" reads (crop)?
  • Will GDAL support it directly, or will it need "unpacked tie points"?

I don't think that there is significant advantage to keeping one to one mapping between any storage standard and in memory representation of the data. This is especially true for rioxarray that supports multiple input data formats. The way GCP data is stored in rioxarray (and in odc-geo also) is directly compatible with rasterio/GDAL in-memory representation of such data (it's basically unpacked GeoTIFF ModelTiepointTag), so it can be trivially passed on to GDAL to perform projection change operations.

If I were asked to implement support for CF 8.3 in odc-geo I would do it as a once off (on first access) conversion from CF 8.3 to internal representation, which is aligned to GDAL data model. Maybe I would also capture raw metadata in some way to allow for creation of outputs that only change pixel values without any loss/noise in coordinate representation of the transformed data.

Kirill888 avatar Sep 04 '25 01:09 Kirill888

I agree with @Kirill888 that the internal representation should be simple and stupid, especially if it matches the rasterio/gdal representation.

Now, the cf description is more complex because it covers cases that, AFAIK, gdal can't handle (i can describe one if needed). So whether we want to support this representation in rioxarray comes down to the objective of this package: provide an interface between xarray and rasterio or provide a full fledged geographic layer on top of xarray? Ping @djhoese that has initiated the geoxarray project to cover the latter case.

mraspaud avatar Sep 04 '25 06:09 mraspaud

Thanks for the feedback, @Kirill888 and @mraspaud. I agree with your pragmatic approach - converting CF 8.3 to GDAL-aligned representation on first access is indeed the right starting point.

You raise a valid point about location. After researching, I see three options:

  1. rioxarray - for a pretty simple ties points interpolation to rasterio GCPs to be used in GDAL.
  2. cf-xarray - Already handles CF conventions and I will probably open an issue there.
  3. xarray - The rasterix module is something pretty similar. I wonder how this aligns with the usage of tie points. I understand it is mainly about virtual coordinates from affine transform

Our specific need is for Sentinel-1 GRD EOPF Zarr datasets where GCPs would need interpolation across overview levels.

The key is finding a solution that works for both immediate needs (Sentinel-1 EOPF) and future extensibility based on CF without creating maintenance burden.

emmanuelmathot avatar Sep 04 '25 07:09 emmanuelmathot

One big challenge (likely a blocker) in reconciling GCPs and CF tie points is that, to my understanding, the latter is used to define the corners of rectilinear interpolation subareas (i.e., the distribution of the CF tie points is regular, with possible gaps) whereas a GCP can be any (row, col) pixel of a raster (i.e., the distribution of GCPs is arbitrary).

Or am I missing something?

benbovy avatar Sep 04 '25 07:09 benbovy

Or am I missing something?

You are correct. Although our use case with Sentinel-1 is covered by CF conventions, the provided GCPs actually form a regular grid over the data, allowing for coordinate interpolation.

INterpolation diagram from CF: Image

My goal is to establish a unique set of tie points that can be used for any data decimation, eliminating the need to create a set of decimated GCPs for each zoom level.

I recognize the challenge posed by discrete points distribution in subareas, but the interpolation mechanism proposed in CF seems to address that issue. Perhaps this is my naive interpretation of the conventions. I am still exploring solutions in this area. 😅

emmanuelmathot avatar Sep 04 '25 07:09 emmanuelmathot

One big challenge (likely a blocker) in reconciling GCPs and CF tie points is that, to my understanding, the latter is used to define the corners of rectilinear interpolation subareas (i.e., the distribution of the CF tie points is regular, with possible gaps) whereas a GCP can be any (row, col) pixel of a raster (i.e., the distribution of GCPs is arbitrary).

Judging by the GDAL docs (Line and Pixel are double precision numbers, not pixel index), tie points don't have to be "pixel centers", so one should be able to translate BoundPoints to GCPs. And if not, there are always coordinate arrays. So, one can always massage this into "workable" representation.

You raise a valid point about location. After researching, I see three options:

I'd say that the best place for this would be cf-xarray for capturing metadata parsing into something directly usable like list of tie points or location arrays. This can then be used by rioxarray to expose GCPs in a format already understood by existing projection change tooling (rioxarray, odc-geo, rasterio, GDAL).

Kirill888 avatar Sep 04 '25 08:09 Kirill888

Judging by the GDAL docs (Line and Pixel are double precision numbers, not pixel index), tie points don't have to be "pixel centers", so one should be able to translate BoundPoints to GCPs. And if not, there are always coordinate arrays. So, one can always massage this into "workable" representation.

Agreed. I guess I was viewing this issue the other way around (i.e., represent GCPs in a CF-compliant way).

The rasterix module is something pretty similar. I wonder how this aligns with the usage of tie points. I understand it is mainly about virtual coordinates from affine transform

rasterix could be a good place to experiment with tie points I think (cc @dcherian @scottyhq). cf-xarray could also be a good place for some convenient conversion API between CF metadata and something else, although the latter would probably be something generic. Both packages may be complementary, similarly to xvec that depends on cf-xarray for encoding/decoding vector geometries to/from CF.

rasterix’s RasterIndex is currently wrapping an affine transform (the index provides lazy spatial coordinates but also allows tracking & handling the affine parameters via the Xarray API). I think that it is totally within rasterix’s scope to extend RasterIndex so it supports GCPs as well. We could also imagine the index accepting CF tie point coordinates as input + some API that generates & returns GCPs on demand, if that makes sense.

benbovy avatar Sep 05 '25 07:09 benbovy

The goal of this change would be to change how rioxarray stores/loads GCPs so it follows a common convention compatible with the community.

The API and behavior will likely remain the same. The only difference would be how the data is stored on the Dataset.

It could potentially lead to support in GDAL and I would suggest opening an issue there to start the discussion.

I agree that interpolation should happen in another package such as rasterix or xarray and not in rioxarray.

snowman2 avatar Sep 05 '25 14:09 snowman2