pygeoapi icon indicating copy to clipboard operation
pygeoapi copied to clipboard

SPARQL provider

Open ksonda opened this issue 3 years ago • 5 comments

This is a WIP/ proof of concept to implement #173. The use case is for a data publisher providing a feature service for a geospatial dataset that wishes to also provide linked-data about the feature from an external knowledge graph with a SPARQL endpoint that may or may not itself contain geometry. I believe this is distinct from the use case addressed by #615 of @ldesousa , in which the SPARQL endpoint functions as geospatial data source in its own right, although it's possible these efforts can be combined. For now this is really only to support GET type requests, and this is not meant to modify a triple store.

In this draft, a SPARQL provider plugin is written that imports all other providers. The user can then configure the provider by specifying

  1. The base data provider and data source as normal
  2. The URL for a sparql endpoint
  3. How to construct the URIs for the subjects in the SPARQL query, with options for either a) A URI field in the base provider that has full HTTP URIs b) A combination of a prefix and a field in the base provider to be appended to it
  4. URIs of predicates to return objects from the SPARQL query for, with labels that become feature properties in the GeoJSON output and predicate labels in the JSON-LD output.

An example configuration and some data is in the branch's docker/examples directory. A demo can also be run as follows docker run -d -p 5000:80 webbben/pygeoapi-sparql-provider

For example, this below configuration tells pygeoapi to serve the features in local "Places" CSV file, but also templates in the triples corresponding to the "populationTotal", "country", and "leaderName" predicates for the corresponding URIs (in the uri field of the CSV) from the dbpedia sparql endpoint.

        providers:
          - type: feature
            name: SPARQL
            data: /ext_data/places.csv
            id_field: index
            geometry:
                x_field: lon
                y_field: lat
            sparql_provider: CSV
            sparql_endpoint: https://dbpedia.org/sparql
            sparql_subject: uri
            sparql_predicates:
                population: dbo:populationTotal
                country: <http://dbpedia.org/ontology/country>
                leader: dbpedia2:leaderName

The resulting GeoJSON: image

In an alternative specification, the URIs in the SPARQL endpoint are not found directly in the base provider but constructed by concatenating the main dbpedia resource prefix with a specified field. In addition, predicate labels are mapped to multiple possible predicates, constructing OR SPARQL queries:

        providers:
            - type: feature
              name: SPARQL
              data: /ext_data/states.gpkg
              id_field: GEOID
              table: states
            #   uri_field: uri
              sparql_provider: SQLiteGPKG
              sparql_endpoint: https://dbpedia.org/sparql
              sparql_subject: ' :NAME'
              sparql_predicates:
                senator: dbp:senators
                motto: dbo:motto|dbp:motto

The resulting GeoJSON:

image

This code can probably be cleaned up quite a bit. Also of interest for further exploration is the extent to which complex SPARQL queries can be configured, if 2nd or 3rd -order relationships can be requested, and how.

possibly of interest to @dblodgett-usgs, @alpha-beta-soup ,@pvgenuchten, @jvanulde , @ldesousa

ksonda avatar May 19 '21 21:05 ksonda

This is a considerably different use case to that I proposed with #615. In this formulation only some of the properties values are transformed into RDF objects with URIs. In #615 all properties names and values are provided as RDF URIs (predicates and objects). #615 is agnostic to geometry type whereas this proposal seems not to be.

The GeoSPARQL provider proposed in #615 can not be combined with the hybrid provider proposed here. This would result in duplicate and/or invalid URIs during transformation. I.e. there would be two RDF sources overlapping.

On a more general note, the current architecture assumes a 1-to-1 match between a provider and a geo-spatial data source type. This proposal splits from that philosophy, thus I believe it begs for some pondering.

ldesousa avatar May 21 '21 13:05 ldesousa

I'll take most of this for this being separate from #615, but I just want to be thorough

In this formulation only some of the properties values are transformed into RDF objects with URIs. We could make the default option of not specifying any predicates as retrieving all triples that have the URI as subject, and naming the properties the predicate URIs as returned by the SPARQL query.

#615 is agnostic to geometry type whereas this proposal seems not to be. I'm not sure about this, at least in intent. The intent for this provider is to allow external information to be templated into the OAF collection items, for an OAF collection based on ANY other provider with any geometry type. If the current draft does not support this then I want to change it. That said...

The GeoSPARQL provider proposed in #615 can not be combined with the hybrid provider proposed here. This would result in duplicate and/or invalid URIs during transformation. I.e. there would be two RDF sources overlapping.

This is indeed what would happen if the #615 GeoSPARQL provider were specified as the "base provider" of a #690 SPARQL provider. I don't think this would happen if instead, the #690 were refactored such #690 could be configured to take the SPARQL endpoint itself as the "base provider", taking GeoSPARQL geometry as in #615, in which case no secondary sparql endpoint would be configured. This may all be too complicated though, in which case I'm fine keeping #615 and #690 separate.

On a more general note, the current architecture assumes a 1-to-1 match between a provider and a geo-spatial data source type. This proposal splits from that philosophy, thus I believe it begs for some pondering.

If this is truly the case than we may have to maintain something like #690 as a custom provider for pygeoapi to meet our use case. I don't think this 1:1 assumption is well communicated by having the configuration block for providers under resources: be named providers with the configuration comment # list of 1..n required connections information though. Is there an example of a configuration out there with a multi-provider collection? I'm happy to meet our use case with separate providers for the same collection. I guess you could just provide an id_field for each provider, and also specify which provider is the "primary" one for @id purposes.

ksonda avatar May 21 '21 17:05 ksonda

@ksonda any update on the status of this one?

tomkralidis avatar Sep 19 '21 23:09 tomkralidis

This will go through another round of development with some research activities beginning Oct 1. So...more coming soon.

ksonda avatar Sep 20 '21 12:09 ksonda

I renamed this branch to sparql and I only just realized that severed the link here by doing so. oops

webb-ben avatar Oct 04 '21 19:10 webb-ben

Closing. Feel free to re-open/issue another PR on this work as appropriate.

tomkralidis avatar Jan 29 '23 01:01 tomkralidis