geonode icon indicating copy to clipboard operation
geonode copied to clipboard

GNIP 92: non-spatial structured data as (pre)viewable FAIR datasets

Open gannebamm opened this issue 2 years ago • 7 comments

GNIP 92 - non-spatial structured data as (pre)viewable FAIR datasets

Overview

We need to store structured non-spatial datasets besides geodata as GeoNode resources. The non-spatial datasets shall provide a simple viewer as a preview and should be able to get used as part of dashboards. The datasets should be findable, accessible and provided in an interoperable way, therefore complying with the FAIR principles.

Proposed By

Florian Hoedt, Thünen-Institute Centre for Information Management

We intend to fund this development as part of an upcoming tender process. This GNIP shall start a discussion about how the developed feature could get upstreamed to the main project.

Assigned to Release

This proposal is for GeoNode 4.0

State

  • [x] Under Discussion
  • [ ] In Progress
  • [ ] Completed
  • [ ] Rejected
  • [ ] Deferred

Motivation

Status Quo: Non-spatial but structured datasets like csv/ Excel files can be uploaded as documents. As document objects, these datasets do inherit the resource base metadata models but can not be viewed in a meaningful way. As a research institute, our scientists often use PostgreSQL databases and tables to store and structure their research data. Currently, those datasets can not be published in any way in GeoNode. As a research institute, we need to store/register structured non-spatial datasets besides geodata as GeoNode datasets (in the meaning of a v4.0 dataset).

Objective: Implement a new category of RessourceBase for structured non-spatial datasets. Instead of using the GeoServer importer to ingest, e.g. shapefiles, into the PostGIS enabled backend, you should be able to define a connection string to the table to use [?]. The non-spatial datasets shall provide a simple viewer as a preview and should be able to get used as part of dashboards.

Proposal

How to realize the above-mentioned feature is still to be discussed.

As part of an internal discussion, we thought about using PostgREST as an accessible and interoperable tabular data provider. One major aspect is to synchronise authorization mechanisms with the new service. Currently, Django and GeoServer do synchronise their roles via GeoFence. Something similar should be implemented by the tabular service provider. There seem to be options to use JWT as part of the django-rest-framework to grant such authorization as explained here: https://gitter.im/begriffs/postgrest?at=61f06b40742c3d4b21b63843

Apart from using PostgREST as a tabular data provider we also considered the new OGC APIs. These may provide enough functionality for this GNIP. For example the EDR (https://ogcapi.ogc.org/edr/).

Backwards Compatibility

It is not intended to backport this GNIP to 3.x

Future evolution

Explain which could be future evolutions.

Feedback

See discussion below...

Voting

Project Steering Committee:

  • Alessio Fabiani: 👍
  • Francesco Bartoli:
  • Giovanni Allegri:
  • Toni Schoenbuchner: 👍
  • Florian Hoedt: 👍

Links

Remove unused links below.

  • Email Discussion
  • Pull Request
  • Mail Discussion
  • Linked Issue

gannebamm avatar Jan 31 '22 14:01 gannebamm

@gannebamm my +1 here, the proposal is actually very good. Of course we need to carefully choose how to convert the structured data into a "standard" format that GeoNode can use later on. It would be nice if you could prepare/provide few samples of possible datasets or provide an idea of the complexity of the structures. You are speaking about Excel documents, but those ones might be very complex. We might need to envisage some hooks/harvesters able to parse and store specific formats. We had a similar use case for the Afghanistan Risk Data portal. In that case we had to create some brand new data structures and parsers able to ingest very specific Excel files for each hazard type.

afabiani avatar Feb 02 '22 13:02 afabiani

Added the GNIP to the wiki page

afabiani avatar Feb 02 '22 13:02 afabiani

Thanks @afabiani for adding it to the wiki page!

It would be nice if you could prepare/provide few samples of possible datasets or provide an idea of the complexity of the structures. You are speaking about Excel documents, but those ones might be very complex.

here you go: soil example dataset

You can switch the metadata language to English and use the BZE_LW English Version. The site.xlsx is the spatial dataset we currently upload as a point layer. The other two xlsx files (LABORATORY_DATA, HORIZON_DATA) are examples of non-spatial datasets. I know, in the end, everything is spatial somehow since the lab and horizon datasets do explicitly or implicitly reference a sample site. Nonetheless, we would like to publish those as non-spatial datasets and enable custom applications to fetch those in an accessible and interoperable way through an API. An example of this kind of custom application can be seen at soilgrids. If you click a coordinate you will derive loads of additional data like this: grafik

However, most of our data is already stored in PostgreSQL databases. I know other research institutes also have working databases that could maybe just get integrated. If we use some ORM like sqlalchemy we could even open this up for a more diverse set of SQL-like data providers, as explained here. But maybe that one is out of scope and we should stay true and close to our current stack, which does use PostgreSQL.

I will ask my colleagues to provide some more examples.

gannebamm avatar Feb 02 '22 17:02 gannebamm

My +1. thanks Florian

t-book avatar Feb 18 '22 12:02 t-book

@gannebamm this proposal is the natural prosecution of the conceptual change we did from "layers" to "datasets" in GeoNode. One of the reasons for the change of the name of these entities is exactly to make room for non-spatial datasets, which are not well represented as "layers".

Before starting the discussion about their presentation (web client, API, standards, whatever) I wonder where we imagine storing these datasets. The first option that comes to my mind is geonode_data DB, which is the one employed by Geoserver At the moment GeoNode has no direct connection with that DB, but I was thinking of this since a while. If we made GeoNode "aware" of the geonode_data DB and models we could:

  • build more advanced and custom analysis and visualization tools on vector spatial datasets
  • expose vector spatial datasets also to the functionality that will be built for non-spatial datasets
  • we have a single data store for both non-spatial and vector datasets, even though the latter are directly managed by Geoserver

I know that this goes against the general advice to keep the services models separate, and in theory, we should only rely on OGC standard interfaces to query spatial datasets, but in the case of GeoNode and Geoserver:

  • they're services composing a single product, so we have total control over the two and their models
  • keeping GeoNode itself tied to the standard interfaces with Geoserver limits the functionality (or make it far more complex and less performant) that could be built

giohappy avatar Feb 18 '22 13:02 giohappy

As I had an offline discussion with @gannebamm on the topic, I am sneaking in on the discussion.

However, most of our data is already stored in PostgreSQL databases. I know other research institutes also have working databases that could maybe just get integrated.

I think @gannebamm is having a slightly different workflow in mind (correct me if I am wrong). The data would not be imported into a central data store, but managed as reference to an existing database. I guess this is the most flexible and scalable approach as otherwise you would need to make sure to preserve the structure in the geonode_data, without conflicts across datasets.

On the other hand, this would bring in the requirement of some sort of default structure anyways, if you do not want to implement special visualizers for each dataset. Maybe it could also be a mappable structure, filled out by the user during import.

Maybe also both scenarios (1. existing DB; 2. import into geonode_data) could be covered. The same questions would still require answers.

matthesrieke avatar Feb 21 '22 17:02 matthesrieke

Maybe the https://github.com/rvinzent/django-dynamic-models technology evaluated and used (?) for the SOS integration (contrib module) can help for this feature request. See: https://github.com/GeoNode/geonode-contribs/issues/172

gannebamm avatar Apr 28 '22 11:04 gannebamm

Dear @giohappy, together with @gannebamm @mwallschlaeger we have started to iterate the requirements and the concept behind a non-spatial dataset feature for GeoNode. We have started a small prototype by setting up a Django App / Contrib module. At the moment, uploading data is achieved by providing a CSV file with a sidecar JSON Tabular Data Resource that describes the schema and types of fields.

@gannebamm pointed us to the new geonode-importer module and we were wondering if this would be a good fit for ingesting the data. It looks to be designed in a way that it would allow the addition of custom/new handlers. Do you think it would fit our purpose?

matthesrieke avatar Nov 25 '22 13:11 matthesrieke

Dear @matthesrieke sorry for the late reply.

First of all, a GNIP for the new importer is on its way, we want to make it a community module asap. At the moment it's hosted under GeoSolution's own repo.

As you note, the new importer lets you implement specific handlers, and it can assume complete control of the lifecycle of a resource. For example, the handler is in charge of doing any housekeeping when a resource backed by specific data and tables is deleted. @mattiagiupponi can tell you much more about it since he's the module's author.

So, the primary use case here is to map a GeoNode resource to an external DB. If we generalize this I'd say that this case isn't strictly related to non-spatial datasets. In our vision, a non-spatial dataset could still be served by Geoserver, that way we can benefit from all the services and WFS-based client tools that we already have. They can work for non-spatial data too. So, follow me, we have two "dimensions" here:

  • implement support for non-spatial datasets on top of the existing tools (here we don't care where the data is located)
  • implement support for alternative DBs to geonode_data (this can work both for spatial and non-spatial datasets)

IMHO we should agree on the first point first, which is the subject of this GNIP. I'm a bit concerned with creating new data models and services. We can improve the current ones (always with back compatibility in mind!), but I'd try our best to avoid adding complexity.

giohappy avatar Dec 07 '22 16:12 giohappy

@giohappy

I am not sure if I understand the two dimensions stated.

on top of the existing tools (here we don't care where the data is located)

We care where the data is located and would like to ingest it into the PostgreSQL backend for later use.

In our vision, a non-spatial dataset could still be served by Geoserver, that way we can benefit from all the services and WFS-based client tools that we already have.

I think because this is not the intended way to use WFS tools like QGIS are likely to fail to understand the non-spatial data served in a WFS. Did anyone test this successfully?

@afabiani @matthesrieke @t-book @francbartoli ->

Maybe we should schedule a talk to discuss this? It is getting quite complex, and I think it would help to dig deep into the pros and cons of the possible approaches and define our needs. Maybe @mattiagiupponi can provide a short intro into the importer and the non-spatial data serving capabilities of GeoServer, and @matthesrieke can describe the prototype he developed to test the approach. In the end, less complexity is always welcome.

If other community developers are interested in coming by, I can host a public meeting. Scheduling this will be rough, though. What do you think?

gannebamm avatar Dec 12 '22 16:12 gannebamm

@gannebamm

We care where the data is located and would like to ingest it into the PostgreSQL backend for later use.

I'm not saying that this isn't relevant, My point is to distinguish the two requirements:

  1. Using an external DB, which isn't strictly related to spatial/non-spatial tables. Notice that you can already publish resources from an external DB (we do it frequently) although there isn't a tool for end-users to configure it. At the moment you have to configure a new Geoserver store and then publish the layer on GeoNode, e.g. with the updatelayers command. However, it should be quite trivial to do it with the new importer
  2. Publishing of non-spatial tables. From my experience, QGIS plays nicely with non-spatial WFS featuretypes. I did a quick test to confirm this. The images you see below are the JSON output from WFS and the same table loaded in QGIS. I'm pushing forward this solution because it comes for free (almost).

We're happy to discuss this in a call.

WFS non-spatial table JSON output image

WFS non-aptial table loaded in QGIS image

giohappy avatar Dec 13 '22 10:12 giohappy

@giohappy @afabiani @matthesrieke @t-book @mattiagiupponi (and everyone else interested!) I would like to schedule a web session to talk about the further development of this GNIP with you. I can provide a WebEx room.

There are some open slots next week for me. Please fill out this poll: https://terminplaner4.dfn.de/FOKIDXEtIVBq8sQB

gannebamm avatar Jan 06 '23 13:01 gannebamm

@gannebamm It looks today is winner? Does the meating happen?

t-book avatar Jan 11 '23 09:01 t-book

@t-book @afabiani @giohappy thanks for the quick replies.

I asked the 52N crew if one person is enough on their side. They should answer soon. I will provide the video conference room info by mail.

gannebamm avatar Jan 11 '23 09:01 gannebamm

Hi @gannebamm I'm sorry that yesterday I was not able to join the meeting but @giohappy give me an update. I just created this small doc with the default handler structure which the importer expects to be available: https://github.com/geosolutions-it/geonode-importer/tree/master/importer/handlers Fell free to open new issues on the importer if something is not clear

mattiagiupponi avatar Jan 12 '23 11:01 mattiagiupponi

Hi @mattiagiupponi thanks for adding the documentation, we will take a closer look. Would you be available for a short meeting to discuss possible technical approaches? Maybe this Thursday between 10-12am? You could also reply to me by mail ([email protected]), so we take this offline from this thread.

matthesrieke avatar Jan 16 '23 11:01 matthesrieke

Hi @mattiagiupponi thanks for adding the documentation, we will take a closer look. Would you be available for a short meeting to discuss possible technical approaches? Maybe this Thursday between 10-12am? You could also reply to me by mail ([email protected]), so we take this offline from this thread.

Hi @matthesrieke For me is fine from 10 am for about 1 hour. I'll keep (for now) this here so we can see if someone else is interested in joining the meeting. If it for you is fine, I'll send u an invitation for the meeting via email

mattiagiupponi avatar Jan 17 '23 13:01 mattiagiupponi

thanks @mattiagiupponi ! Yes, 10am tomorrow is fine for me. I will be joined by @autermann and @ridoo

matthesrieke avatar Jan 18 '23 09:01 matthesrieke

@mattiagiupponi I would like to attend, too.

gannebamm avatar Jan 18 '23 11:01 gannebamm

@matthesrieke @gannebamm we're planning to complete the transition to the new importer very soon, and make it the default importer in 4.1.x.

As you know the new importer misses the CSV handler. We were waiting to implement a solution that should replace the upload steps that we have now, where the lat/lon column can be selected at upload time. We cannot afford to implement a new UI for the custom selection of columns, so our proposal would be the following:

  • preconfigure the CSV handler with OGR X_POSSIBLE_NAMES="x,lon*" and Y_POSSIBLE_NAMES="y,lat*" options
  • accept a companion "*.csvt" file, as supported by the OGR CSV driver

This solution would provide an alternative that's not too expensive and complex to implement, and gives the opportunity to remove the current upload system (at the moment it's still required only for CSV files).

I'm not against the solution based on Tabular Data Resource and VSI. I think all these options could coexist, letting the handler pick up the best depending on the provided files, with X_POSSIBLE_NAMES and Y_POSSIBLE_NAMES preconfigurations as a fallback.

What's your opinion?

giohappy avatar Feb 13 '23 13:02 giohappy

@matthesrieke please take a look at @giohappy comment. I do not see that as an issue. We will have two importer handlers, one for geospatial csv and one dedicated for non-spatial csv with TDR / VSI. What do you think?

gannebamm avatar Feb 24 '23 18:02 gannebamm

@gannebamm @giohappy We also see no problem. Both solution can co-exist and serve different use cases (one for simple csv uploads and one for whole datapackages). I like the "csvt solution" as well -- quite pragmatic. How would you communicate to the user about the configured name pattern for columns containing geometry information?

ridoo avatar Mar 01 '23 15:03 ridoo

@ridoo @gannebamm unfortunately our experiments with the CSV driver options and the csvt file didn't give the expected results. Apparently, the OGR Python API does not take them into account, and this is problematic since we leverage the API to extract schema information and prepare the dynamic models.

For the moment we have implemented the basic solution, where only a fixed set of column names are recognized. There's a PR ready for an internal review, but if you want to take a look and suggest improvements you're welcome! https://github.com/geosolutions-it/geonode-importer/pull/157

giohappy avatar Mar 01 '23 15:03 giohappy

@giohappy @mattiagiupponi That is a pity to read. I only played around with it on the CLI, so I cannot tell much more on this.

On our datapackage.json approach, we are now able to import csv data as described in the tabular data descriptor. However, there are still issues to solve:

  • Pass a "fake" SLD file (see first comment below)
  • Add a tabular subtype (see second comment below)
  • Create a simple UI preview (instead of showing an empty map on the detail page)

Comment 1: I see that the style of a layer is mandatory during upload. To my understanding right now, I can decide to get errors either from the rest_framework (requiring SLD) and/or geonode.geoserver.helpers.py#get_sld_for() which tries to get a default style for layer's name. For now, I am trying to pass a minimal style file which becomes available under "Styles" in Geoserver, but does not appear in the GWC. I can see GeoServer throwing an exception:

02 Mar 15:52:59 DEBUG [geoserver.monitor] - Testing /gwc/rest/layers/geonode:laboratory_data.xml for monitor filtering
geoserver4thuenen_atlas  | 02 Mar 15:52:59 DEBUG [geoserver.monitor] - /geoserver/gwc/rest/layers/geonode:laboratory_data.xml was filtered from monitoring
geoserver4thuenen_atlas  | 02 Mar 15:52:59 ERROR [geoserver.rest] - Unknown layer: geonode:laboratory_data
geoserver4thuenen_atlas  | org.geowebcache.rest.exception.RestException 404 NOT_FOUND: Unknown layer: geonode:laboratory_data
geoserver4thuenen_atlas  |      at org.geowebcache.rest.controller.GWCController.findTileLayer(GWCController.java:45)
geoserver4thuenen_atlas  |      at org.geowebcache.rest.controller.TileLayerController.layerGet(TileLayerController.java:70)

For now I can ignore this .. but for the future, it would be nice to have a less hackish way introducing tabular data.


Update: The error happens when GeoNode tries to invalidate the GWC. GeoServer does not know the resources logs a 404 but actually returns a 500 (you can see it in the Browser Logs). This lets GeoNode throw an exception ("too many 500 error responses").


Do you think this is the right way to pass a fake SLD file along the upload of non-spatial/tabular data?

Comment 2: During upload the the nonspatial/tabular data become of type VECTOR .. Calling http://localhost/geoserver/rest/layers/laboratory_data.xml gives me

<layer>
  <name>laboratory_data</name>
  <type>VECTOR</type>
  <resource class="featureType">
    <name>geonode:laboratory_data</name>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="http://localhost/geoserver/rest/workspaces/geonode/datastores/geonode_data/featuretypes/laboratory_data.xml" type="application/xml"/>
  </resource>
  <attribution>
    <logoWidth>0</logoWidth>
    <logoHeight>0</logoHeight>
  </attribution>
  <dateCreated>2023-03-02 15:49:59.20 UTC</dateCreated>
</layer>

It seems that geonode.geoserver.helpers.py#sync_instance_with_geoserver() is mapping subtype=dataStore to vector and just overrides the subtype=tabular of my instance. What do you think would be the best location to make adjustments to add a tabular type?

The PR look ok on a first glimpse (could not spent too much time on it, though).

ridoo avatar Mar 02 '23 16:03 ridoo

Hi @ridoo By default, the SLD style is never mandatory during the import phase for the geonode-importer or the legacy upload system. Geonode always tries to create a default one because it is nature, so you get that error.

A possible approach is to use the custom_resource_manager provided by the importer.

This manager is meant to override the default one to exclude common communication with GeoServer during create/copy/update phase of the resource, I guess in your case you have also to override the "create" method so GeoNode should not try to create the SLD style by adding something like this:

def create(self, uuid, **kwargs) -> ResourceBase:
    return ResourceBase.objects.get(uuid=uuid)

NOTE: the layer in GeoServer (as always) should be imported and published by the previous step importer.publish_resource. With the other handlers, we let this be done by the default manager, since it will sync the GeoNode resource with the one in GeoServer.

Then override the handler create_geonode_resource function to use the custom resource manager instead of the default one, with something like:

def create_geonode_resource(
    self, layer_name: str, alternate: str, execution_id: str, resource_type: Dataset = Dataset, files=None
):
   .......
    saved_dataset = custom_resource_manager.create(
        None,
        resource_type=resource_type,
        defaults=dict(
            name=alternate,
            workspace=workspace,
            subtype="raster",
            alternate=f"{workspace}:{alternate}",
            dirty_state=True,
            title=layer_name,
            owner=_exec.user,
            files=list(set(list(_exec.input_params.get("files", {}).values()) or list(files))),
        ),
    )

   .......
    return saved_dataset

Related to the second comment, I'm sure we talked about that for now, GeoNode is not ready to handle non-spatial resources and it will require some work to enable it. @giohappy for sure can gives you more hints on it

mattiagiupponi avatar Mar 03 '23 09:03 mattiagiupponi

@mattiagiupponi thanks for the hint, I will by-pass the importer's resource_manager and use my own.

Yes, we have talked about the limitation regarding non-spatial/tabualr data in GeoNode. However, I was unsure if you had further thought about possible pitfalls and/or ideas to overcome those :).

ridoo avatar Mar 06 '23 08:03 ridoo

@giohappy @mattiagiupponi Since the GeoNode 4.1 release was postponed but is likely imminent, shall we try to get this new feature into the upcoming 4.1 release?

gannebamm avatar May 02 '23 13:05 gannebamm

Since the GeoNode 4.1 release was postponed but is likely imminent, shall we try to get this new feature into the upcoming 4.1 release?

@gannebamm I'm a bit lost. I don't see a PR connected to this issue, and I'm not sure if a solution has been implemented for the presentation of non-spatial datasets.

giohappy avatar May 08 '23 12:05 giohappy

@ridoo Giovanni is correct. Didn´t we create a PR somewhere for this feature?

gannebamm avatar May 09 '23 11:05 gannebamm

@gannebamm we did create PR https://github.com/GeoNode/geonode/pull/10842 which was needed to keep all unpacked files from an uploaded zip-file. However, the actual work to support non-spatial (tabular) data is a bit distributed:

ridoo avatar May 09 '23 12:05 ridoo