science-on-schema.org icon indicating copy to clipboard operation
science-on-schema.org copied to clipboard

Representing geospatial data, a set of approaches

Open fils opened this issue 4 years ago • 42 comments

Based on some working going on in the IGSN community we have been looking at what recommendation to make to this community. With help from @datadavev @ashepherd @abhritchie @dblodgett-usgs @jesserobertson I've working up the following for discussion.

The goal of this simple data graph is to present 4 options for representing spatial data available to us.

  1. subjectOf link to .geojson (the "linked data" pattern)
  2. the geosparql:hasGeometry.. (the OGC spatial in graph pattern)
  3. JSON literal... (the embed JSON in JSON-LD as a literal pattern see geoblob in the context and the body) This approach requires JSON-LD 1.1
  4. schema.org

This is being put forth for discussion and comments so that we can better refine the example. The basic POV is that there are many ways to represent spatial and any given community may have use cases that drive them.

For example, the schema.org approach is likely the only item that will deliver spatial data to Google, the other approaches allow for the representation of CRS parameters or deliver spatial data more aligned with OGC patterns. GeoSPARQL for example is likely the best patter for spatial data in spatially aware triple stores.

This issue is more to provide information on the options and not provide a recommendation.

Gist link: https://gist.github.com/fils/5899894e5d5783f8da0f92043a97badd?short_path=86de4fd

Load to playgroud: https://tinyurl.com/y9zajhov

{
    "@context": {
        "@version": 1.1,
        "geoblob": {
            "@id": "http://example.com/vocab/json",
            "@type": "@json"
        },
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "description": "http://igsn.org/core/v1/description",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "schema": "https://schema.org/"
    },
    "@id": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0",
    "@type": "http://igsn.org/core/v1/Sample",
    "description": "A fake ID for testing",
    "schema:subjectOf": [
        {
            "schema:url": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0.geojson",
            "@type": "schema:DigitalDocument",
            "schema:format": [
                "application/vnd.geo+json"
            ],
            "schema:conformsTo": "https://igsn.org/schema/spatial.schema.json"
        }
    ],
    "geosparql:hasGeometry": {
        "@id": "_:N98e75cacc29f40deb555eb583cb162dc",
        "@type": "http://www.opengis.net/ont/sf#Point",
        "geosparql:asWKT": {
            "@type": "http://www.opengis.net/ont/geosparql#wktLiteral",
            "@value": "POINT(-76 -18)"
        },
        "geosparql:crs": {
            "@id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }
    },
    "geoblob": {
        "type": "GeometryCollection",
        "geometries": [{
            "type": "Point",
            "coordinates": [-76, -18]
        }]
    },
    "schema:spatialCoverage": {
        "@type": "schema:Place",
        "schema:geo": {
          "@type": "schema:GeoCoordinates",
          "schema:latitude": -18,
          "schema:longitude": -76
        }
      }
}

fils avatar Jun 01 '20 20:06 fils

Just checking but all four methods support all of the various types of spatial data? For example, bounding boxes, sets of points, etc.?

rduerr avatar Jun 01 '20 20:06 rduerr

@rduerr

So the "subject" of approach is a link to external GeoJSON. So the full set of spatial geometries it can do would be there.

The GeoSPARQL approach is really just WKT as a geo:wktLiteral, so again all the geometry of WKT is there.

The schema.org approach is https://schema.org/GeoShape (with all the wonderful features of that.. sigh).

The JSON Literal approach is should work but my only concern there is that I've just never tested it with a highly complex GeoJSON as the literal package. However, it "should" work. If we can break it that is likely something to report to JSON-LD 1.1 itself as an issue. So my guess (hope) is that has all been covered.

fils avatar Jun 01 '20 20:06 fils

For those curious about GeoJSON-LD and why it's not here I'd reference you to https://github.com/opengeospatial/SELFIE/issues/52

fils avatar Jun 01 '20 20:06 fils

The only thing you'd have to be careful with the direct embedding approches (both WKT and the plain ol' GeoJSON blob) would be performance.

Wouldn't be great UX if you've embedded JSON data as a header in a landing page and a complex geometry requires your browser to load 1 Gb of GeoJSON before the page renders...

jesserobertson avatar Jun 02 '20 04:06 jesserobertson

Yes, if you embed serialized geometry in the triple store it can blow out storage and performance.

In Loc-I we separated out all the geometry data into a separate store, and only serialize it on demand. See http://loci.cat/geometry-data-service.html

We have an instance deployed here https://gds.loci.cat/ e.g. https://gds.loci.cat/geometry/asgs16_sosr/203

There are multiple representations available using conneg & args

Also, simplified geometries:

And centroids:

The URI is the same for all of these, just variations in the args.

Swagger here: https://gds.loci.cat/api/doc/

dr-shorthair avatar Jun 02 '20 05:06 dr-shorthair

@jyucsiro did the implementation. @jyucsiro @benjaminleighton and @dr-shorthair did the design.

dr-shorthair avatar Jun 02 '20 05:06 dr-shorthair

@dr-shorthair @jyucsiro @benjaminleighton nice API!

I guess it's slightly orthogonal to the web architecture that we're proposing in IGSN though, since you don't want to have to understand an API to crawl pages right?

jesserobertson avatar Jun 02 '20 11:06 jesserobertson

One point we raised in the sprint meeting tonight is that these aren't mutually exclusive ways of publishing data - if publishers wanted to support both complex JSON and a simplified geometry for schema.org/Google purposes (e.g. a bounding box or centroid) they could include both serializations in the document at the same time.

jesserobertson avatar Jun 02 '20 11:06 jesserobertson

I guess it's slightly orthogonal to the web architecture that we're proposing

Is it? Instead of inlining the geometry, you can have a URI reference. Then your crawler can follow the link. The basic link to a geometry gets you a web page which is littered with more links which can be crawled.

You can set a http Accept header to get the geometry in whichever representation you want without needing to understand the API - try it!

https://gds.loci.cat/geometry/asgs16_sosr/203 Accept: text/plain or text/turtle or application/json or text/html

dr-shorthair avatar Jun 02 '20 21:06 dr-shorthair

@jyucsiro could we add schema:GeoShape to the options in GDS?

dr-shorthair avatar Jun 02 '20 21:06 dr-shorthair

and schema:Place (maybe linked into the 'centroid' group, since it is just a lat-long).

dr-shorthair avatar Jun 02 '20 21:06 dr-shorthair

Is it? Instead of inlining the geometry, you can have a URI reference.

Agreed it's not from the PoV of the crawler. It might be from the PoV of the publisher if this is another service that needs to be run. It's a good option for publishing if they've already got a spatial store somewhere (likely if spatial data is important to them).

jesserobertson avatar Jun 03 '20 20:06 jesserobertson

We explored this in ELFIE. Here's where we landed. https://docs.opengeospatial.org/per/18-097.html#_preview_geometry

Search that doc for "geojson" to see other places that might have interesting content re: this issue.

At the end of the day, I think it's critical to think about "what is the use case?" AND "who is the client?"

The answer to the first will dictate whether you need an "analysis-grade" geometry and all that entails or if you can get away with a "preview-grade" geometry.

The second will answer what encoding you can get away with and what kind of network architecture your client will tolerate.

As unsatisfying as it is, a point and/or convex hull encoded in a schema:geo block is probably about all we should be considering in the world of search indexing.

In other use cases -- linked data graphs, for example -- I think the calculus is much more nuanced but we probably want to lean toward what the GeoSPARQL folks decided and some of the network architecture logic (in line or not) that @dr-shorthair applies in spades there.

dblodgett-usgs avatar Jun 03 '20 21:06 dblodgett-usgs

Building on/recapitulating what @dblodgett-usgs says and summarizing a side conversation that has been running with him, @fils, @jesserobertson, and @ashepherd...

One thing that is becoming clear in the discussions the ELFIEs have provoked is that there is a useful set of data shapes that can be defined and implemented based on media-type. With each media type being better aligned to particular clients and their use cases.

JSON-LD (using the schema.org vocabulary) as structured data in HTML is perfectly pitched at the indexing use case and, picking an example entirely at random, Google's expectations as a client. Using schema.org geometries in this context isn't unsatisfying at all - we are speaking the language of the target audience.

Meanwhile, JSON-LD allows us to be more expressive for data engineers and scientists, expanding our vocabulary to use domain ontologies, including GeoSPARQL and its more robust spatial data types (well understood data types that are much easier for me to use in other systems, like PostGIS).

Straight away we can access the content we need using content negotiation. (I know there a nuance's here, but it is a good start.)

GeoJSON nicely straddles these worlds allowing us to provide rich spatial representations to things like web applications, but at a cost - most notably the restriction to WGS84 as a CRS. This isn't a problem. GeoJSON is successful because it does a few things and does them well. Where it doesn't meet our needs we have alternatives (JSON-LD+GeoSPARQL). I labour this point because there is a tendency to try and merge representations rather than switch between them. GeoJSON-LD is a good example and is the subject of a whole 'nother thread. To me, however, it is a solution in search of a problem.

Ultimately, we think there's value binding a default spatial data type to a media type (HTML+JSON-LD: schema.org; JSON-LD: GeoSPARQL; GeoJSON: GeoJSON). Being the open world can can link across these (as @dr-shorthair shows) and use other vocabularies as appropriate but a core set of shapes for each media type is surely more developer friendly (it certainly makes this data engineer happy).

abhritchie avatar Jun 03 '20 23:06 abhritchie

Whether to link to or inline geometries is a slightly different problem for which there can be no hard and fast rules. @fils desire to 'provide information on the options and not provide a recommendation' is wise here.

Sure it is probably always unwise to inline a not schema.org geometry in a HTML+JSON-LD landing page but elsewhere it is hugely impacted by the ontology and use case - sometimes minimizing the number of requests a client has to make is better for API performance than minimizing the size of a response.

abhritchie avatar Jun 03 '20 23:06 abhritchie

Its a shame that we had 2 broadly adopted geometry serializations: WKT & GeoJSON, which already had a lot of support in software and libraries, then Schema.org had to butt in with their own. Their community process in this area is strangely impervious to prior art (how did this happen @danbri?). But that's life I guess.

Just make sure the type is clear and we can rely on libraries to take care of it I guess.

dr-shorthair avatar Jun 03 '20 23:06 dr-shorthair

FWIW - adding both GeoJSON and Schema.org serializations, alongside WKT and GML, is already on the agenda for the revision of GeoSPARQL

dr-shorthair avatar Jun 03 '20 23:06 dr-shorthair

(We'll likely test it out in http://linked.data.gov.au/def/geox first)

dr-shorthair avatar Jun 04 '20 00:06 dr-shorthair

@abhritchie out of curiosity is there anything against agreeing on an extra 'crs' member in your GeoJSON?

It's not against the spec (see https://tools.ietf.org/html/rfc7946#section-6.1) but I guess this would be non-normative and your json wouldn't work in a webmap straight away.

Might be a better approach then munging everything into EPSG4326 though...

jesserobertson avatar Jun 04 '20 04:06 jesserobertson

Hahahaha about 5 years ago I tried to get the GeoJSON guys to soften just a little and accept an optional CRS pointer. This really was a point on which they absolutely would not budge. I tried to sell it to them on the basis that without it they exclude some important markets, but nothing doing. They really see the non-CRS niche as big enough. Maybe they're right. They can get quite rude about people who want more, and shoo them off to GML.

dr-shorthair avatar Jun 04 '20 04:06 dr-shorthair

Oh they budged ... into a more restrictive positive than for the 2008 specification. The current spec is quite explicit about the use of WGS84: https://tools.ietf.org/html/rfc7946#section-4.

We could take advantage of the wriggle room they give

However, where all involved parties have a prior arrangement, alternative coordinate reference systems can be used without risk of data being misinterpreted.

but we (I assume) want widespread, not niche, use. The risk of misinterpretation is high.

abhritchie avatar Jun 04 '20 04:06 abhritchie

Yeah - it was while they were moving GeoJSON into IETF that I was talking to them. My interventions possibly caused the robust clarification to be added. Recommended to don an emotional suit-of-armour ahead of every interaction.

dr-shorthair avatar Jun 04 '20 04:06 dr-shorthair

Here's the initial thread http://lists.geojson.org/pipermail/geojson-geojson.org/2013-May/000740.html

dr-shorthair avatar Jun 04 '20 05:06 dr-shorthair

From a purely technical perspective it was a poor decision, but there's merit in that its simplicity does make for a more straightforward implementation path. There are fewer choices to make and things to understand. This is a big factor in its success.

Being in a charitable mood I assume schema.org's (flawed) decision to bake their own serialization is motivated by a similar desire for internal consistency and simplicity.

(Only a cynic would assume ego plays a significant role in the standardization process.)

abhritchie avatar Jun 04 '20 05:06 abhritchie

@dr-shorthair that initial thread, yikes

The only people who are going to complain about this change are geodesists and other coordinate nerds, but they have the GML book to take shelter with.

jesserobertson avatar Jun 04 '20 05:06 jesserobertson

There's an endearing clarity that comes from ignorance.

Still, there's something the science data community can learn from here. We deal with more complex data and need the freedom to say new or different things, and we need to support multiple communities. But ... we should try and strive for some simplicity/elegance/consistency wherever possible to help with adoption and uptake. First by embracing GeoJSON and schema.org as is (what I understand scienceonschema.org is doing) and focus on a complimentary effort to fill the gaps.

I'm labouring the point because @dr-shorthair's comment about the revision of GeoSPARQL (now with 100% more WKT, GML, GeoJSON, schema.org) made me sad. It feels a bit like being all things to all people with the effect that, like Vogon spaceships, the spec if not so much constructed as congealed.

Saying 'just make sure the type is clear and we can rely on libraries to take care of it I guess' is easy to type, but as someone writing scripts to parse these/take care of data its dispiriting. Especially because in the past I've been criticized for advocating approached that involve making a lot of nuanced decisions during implementation.

I'll stop now before @fils yells at me for hijacking his issue.

abhritchie avatar Jun 04 '20 05:06 abhritchie

I'm not sure its as tragic as that @abhritchie .

One thing that GeoSPARQL got right is a clear boundary between semantics and coordinates/shapes. The latter are generally processed by different tools than reasoners, so having a clear transition from the semantic-graph to the geometry-blob is good. And it also means that you can substitute different microformats in the geometry-blob without disturbing the basic integrity of the semantics. There is no suggestion that boundary will be breached in any revision of GeoSPARQL, so I think we are safe.

Having an external geometry-data-service, so the geometry is via a URI-reference, with negotiation about the format of the payload, makes this separation even more clear.

What I worry more about is, GeoJSON embedded in JSON-LD - this really does blur the semantic/geometry boundary.

dr-shorthair avatar Jun 04 '20 05:06 dr-shorthair

BTW - the GeoJSON dudes are definitely not ignorant - some very skillful people there. Including the creator of Shapely. They were really trying to find the 90-10 sweet spot.

dr-shorthair avatar Jun 04 '20 05:06 dr-shorthair

Just to be clear @dr-shorthair, the intent was to call that comment ignorant, not the community. Again, we've a lot to learn from them and a lot to gain by simply using it 'as is'.

abhritchie avatar Jun 04 '20 06:06 abhritchie

my 2 cents-- Schema.org is (as far as I can tell) oriented towards dataset/resource level indexing. Yes, their decision to ignore prior art (per @dr-shorthair, above) is unfortunate, but of course not without precedent. It seems to me that from the point of view of dataset indexing, bounding boxes and centroids have been in use for awhile now (under various standards and serialization schemes), and although not perfect, seem to have performed as a good 80/20 solution for indexing/discovery. The current Science on Schema.org recommendations (see https://github.com/ESIPFed/science-on-schema.org/pull/104) attempt to provide a convention for consistency/interoperability at that level.

I think we can consider encoding specify feature locations, e.g. sampling features of various types ( boreholes, sample locations, image footprints ), or feature extents (geologic polygons, vegetation classes, buildings ...) in data is a separate problem that can be solved by a variety of solutions mentioned in the discussion above. I'd argue that for data distribution, what we need are conventions to define and identify profiles (data types) that specify particular geospatial location conventions for data conforming to the profile, and content-negotiated services that advertise the profiles available and how to get those representations.

smrgeoinfo avatar Jun 05 '20 17:06 smrgeoinfo