qlever icon indicating copy to clipboard operation
qlever copied to clipboard

OSM Planet - Examples and geospatial features current status

Open LorenzBuehmann opened this issue 2 years ago • 25 comments

It's more a general question, but is the geospatial support still deployed on the public QLever instance? And did you change the OSM data model somehow? I'm asking because I tried some queries from the examples, e.g.

Query

PREFIX osmkey: <https://www.openstreetmap.org/wiki/Key:>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX osm: <https://www.openstreetmap.org/>
PREFIX osmrel: <https://www.openstreetmap.org/relation/>
SELECT ?name WHERE {
  ?osm_id osm:envelope ?envelope .
  ?osm_id osmkey:building "university" .
  ?osm_id osmkey:name ?name .
  FILTER contained(?envelope, "LINESTRING(5 47, 16 56)")
}

which fails with

Error processing query

ParseException, cause: Unexpected input: contained(?envelope, "LINESTRING(5 47, 16 56)") }
Your query was:

...

Query:

PREFIX osm: <https://www.openstreetmap.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX osmkey: <https://www.openstreetmap.org/wiki/Key:>
SELECT ?region ?name ?iso31662 ?shape WHERE {
  ?region rdf:type osm:relation .
  ?region osmkey:ISO3166-2 ?iso31662 .
  ?region osmkey:name ?name .
  ?region geo:hasGeometry ?shape .
  FILTER REGEX(?iso31662, "^"DE-")
}

fails with

Error processing query

ParseException, cause: Unexpected input: DE-") }

Query

PREFIX ogc: <http://www.opengis.net/rdf#>
PREFIX osmrel: <https://www.openstreetmap.org/relation/>
PREFIX osmkey: <https://www.openstreetmap.org/wiki/Key:>
SELECT ?castle ?name ?class ?ruins WHERE {
  osmrel:51701 ogc:contains ?castle .
  { { ?castle osmkey:historic "castle" } UNION
  { ?castle osmkey:historic "tower" . ?castle osmkey:castle_type "defensive" } } UNION
  { ?castle osmkey:historic "archaeological_site" . ?castle osmkey:site_type "fortification" }
  ?castle osmkey:name ?name .
  ?castle osmkey:ruins ?ruins .
  OPTIONAL { ?castle osmkey:historic ?class }
  OPTIONAL { ?castle osmkey:archaeological_site ?class }
}

doesn't really run according to the UI, but I expect an error not even shown in the UI.

Query

PREFIX osm: <https://www.openstreetmap.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX osmt: <https://www.openstreetmap.org/wiki/Key:>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX osmway: <https://www.openstreetmap.org/way/>
SELECT DISTINCT ?node_1 ?node_2 ?dist ?highway ?maxspeed WHERE {
  ?way osmt:highway ?highway .
  ?way rdf:type osm:way .
  ?way osmway:node ?m .
  ?m osmway:node ?node_1 .
  ?m osmway:next_node ?node_2 .
  ?m osmway:next_node_distance ?dist .
  OPTIONAL { ?way osmt:maxspeed ?maxspeed }
}

leads to an empty result, debugging the query and it looks like there is no such triple matching ?way osmway:node ?m ., i.e. no way has such a node assigned.

Misc

"All 16 states of Germany" shows up a query that doesn't stick to Germany I guess.

LorenzBuehmann avatar May 30 '22 09:05 LorenzBuehmann

Thank you for these questions and queries, Lorenz, I will reply to them one comment at a time.

Concerning your first query ("contains envelope"): This was realized using #413, which is based on an old version of QLever, where values used to be implemented very inefficiently as strings. In the meantime, QLever's handling of values has been completely refactored and is now much more efficient, see #648 and #650 (and some preparatory PRs). In a nutshell, values that fit into an 8-byte integer (that is, most values) are now represented directly in their ID instead of as strings like before.

Also note that since #638, QLever supports geof:latitude and geof:longitude for WKT points (example on Wikidata: all objects in a stripe around the 48°N latitude), as well as geof:distance (example on Wikidata: all objects in a 100km ring around Freiburg). These functions are not yet super fast because they parse the WKT strings at query time, but that will be relatively easy to change now with QLever's new value handling.

@joka921 How much do you consider it to adapt #413 to the current master? I would expect that much of the magic that we needed for #413 is not needed anymore because the values are now efficient out of the box already.

@LorenzBuehmann Is this a feature which you need urgently or were you just curious?

hannahbast avatar Jun 01 '22 22:06 hannahbast

Concerning your second query ("the 16 states of Germany, selected via their ISO3166-2 code"): In your query, the quotation mark in the REGEX was not escaped. If you escape it, the query works: https://qlever.cs.uni-freiburg.de/osm-planet/C76hnG

We are aware that according to the SPARQL standard, the REGEX should actually be "^DE-" and not "^\"DE-". This is very easy to fix (but needs some care to not miss any cases in the code), we just didn't do it yet.

hannahbast avatar Jun 01 '22 22:06 hannahbast

Concerning your third query ("all castles in Switzerland"): As explained in the osm2rdf paper, the ogc:contains predicate (which would be enormous if materialized) is realized via two separate and manageable predicates osm2rdf:contain_area and osm2rdf:contains_nonarea, which are joined at runtime according to the needs of the query. That is, in the terminology you used in https://github.com/ad-freiburg/qlever/discussions/592#discussioncomment-2302667, ogc:contains is implemented as a magic predicate.

The query rewriting currently happens in the QLever UI and is imperfect because it uses regexes instead of a proper SPARQL parser. It didn't work for your query because of the many nested braces. I fixed that bug and your query now works: https://qlever.cs.uni-freiburg.de/osm-planet/CULQMR . I have encountered that bug many times before, but never had the nerve to fix it. Now I did thanks to your request :-)

hannahbast avatar Jun 01 '22 23:06 hannahbast

Concerning your fourth query ("all highway road segments"): ´osm2rdf` has various options to control which of these triples are produced: https://github.com/ad-freiburg/osm2rdf/blob/master/src/config/Config.cpp#L86-L110 . In the instances that currently run on https://qlever.cs.uni-freiburg.de, these triples have not been added. Do you need them or were you just curious?

hannahbast avatar Jun 01 '22 23:06 hannahbast

Concerning your fifth query ("all 16 states of Germany" on OSM Planet): I have changed the name of this example query :-)

hannahbast avatar Jun 01 '22 23:06 hannahbast

Good morning/afternoon ? (according to your 2am responses ...) @hannahbast

First of all, many thanks for the helpful comments and bug fixes already.

My questions have been mostly out of curiosity as I'm playing around with OSM data and GeoSPARQL stuff in my current work. That's why I read the paper and also then tried the examples registered in the QLever UI. So no need to hurry up or change your schedule, I'll take what I get when you have the time.

So, all of the queries reported here are not created by myself. Maybe it's worth to either omit those in the UI then, at least the fourth query where the data isn't loaded, thus, the query always returns an empty result - might be confusing otherwise for users.

Nice to hear that you're still in the process of improving and speeding up things, e.g. the literal value handling part. And I totally agree, that WKT parsing and indexing should happen already during loading and indexing time. At least that's how most spatial indexes work, currently you can't make use of such unless you would load the WKT literals into some spatial index like structure, e.g. ST Tree, R tree and similar. In the long term I think you could try to support a larger part of geospatial features, especially beyond containment checks - but be careful, it's a huge topic (personally I'd start with just the basic topological and non-topological simple functions)

Minor comment: please keep in mind that neither geof:latitude nor geof:longitude are part of the GeoSPARQL standard, from my point of view they simply "forgot" this in version 1.0 and even for the upcoming 1.1 they decided to omit those functions as there will be maxX and maxY then, both of which trivially cover lat/long of a point.

LorenzBuehmann avatar Jun 03 '22 06:06 LorenzBuehmann

@LorenzBuehmann Thanks + yes, I will clean up the example queries, they should indeed all work.

Where is the Swiss castle query from, that's not from us, or is it?

We included geof:latitude, geof:longitude, and geof:distance because the Wikidata Query Service uses them and many example queries there. So it would be a pity not to have them for our Wikidata instance.

And yes, GeoSPARQL is a bottomless barrel. Our strategy will be to cover the basic stuff and do those things fast, for which other engines are slow or don't manage at all. For example, PostgreSQL+PostGIS is very slow with contains queries. The only fast engine for that is Overpass, but that is so not combinable with anything else.

Another example: we have started working on the efficient visualization of large result sets. It always bothered me when Map UIs break down when you have more than a few thousand results to show. Here are all 17M trees in OpenStreetMap: https://qlever.cs.uni-freiburg.de/osm-planet/giNBGB . Ich you click on Map View++, you will get a nice view on all zoom levels and it reacts pretty fast given the large result set.

hannahbast avatar Jun 03 '22 08:06 hannahbast

@hannahbast

Where is the Swiss castle query from, that's not from us, or is it?

Sure it is: "All castles in Switzerland" Now this query is working, not sure if you fixed something in the meantime? Did you?

For example, PostgreSQL+PostGIS is very slow with contains queries.

Is it? In my very limited world this is supposed to be a fast geospatial database. Do you have some example and numbers in mind? Do you refer to point in polygon queries or even something beyond? Maybe you're right and for some cases it's just "good enough", but at least it does cover a lots of features. And sometimes needs investigation and tuning, when looking at this nice example post: https://www.crunchydata.com/blog/performance-and-spatial-joins (Spoiler: GPU-assisted spatial join has been faster in comparison, but well - it needs some GPU(s))

Yep, know what you mean - UI becomes very sluggish. What is your approach? Looks like you merge points into clusters and make use of some heatmap feature of Leaflet?

LorenzBuehmann avatar Jun 08 '22 06:06 LorenzBuehmann

@LorenzBuehmann

Thanks for reminding me about the "All castles in Switzerland" query :-) It came from one of our users and since I found it interesting and challenging, I added it. So in this sense, it is both not from us (but from that user) and from us (because I added it as an example query, which I forgot in the meantime).

Thanks for the blog post. It speaks about a spatial join between a set of 9M parking violations and a set of 150 neighborhoods of Philadelphia. That is no big challenge: the neighborhoods are few and have a rather simple shape, so a simple R-tree will do the job: For each shape, get the parking violations (which are just points) in the bounding box of the shape, and for each of them compute whether they are really contained. That is a matter of seconds even without any parallelization. And since this algorithm is trivially parallelizable, you get a speedup of k if you use k cores and even more with special hardware like a GPU

For OSM Planet, you have 1.5 billion geometric objects, some of them very complex, like the border of a country. For Germany alone, you have 118M objects and a lot of complex borders. For example, consider the relatively simple query "all post boxes in Baden-Württemberg", which QLever can handle easily: https://qlever.cs.uni-freiburg.de/osm-germany/zss8C5 . The result is 11,270 objects. PostgeSQL+PostGIS will do at least these many comparisons to the complex border of Baden-Württemberg (and some more comparisons of objects close to the border). We have tried it and it takes forever. Parallelization or GPUs will not solve this problem, it's simply too much computation.

And that is a simple query. Much harder ones are "all buildings in Baden-Württemberg": https://qlever.cs.uni-freiburg.de/osm-germany/sGlYuZ (very many comparisons against a complex shape) or "all level-6 administrative regions in Baden-Württemberg": https://qlever.cs.uni-freiburg.de/osm-germany/PPx8OQ (many expensive comparisons between complex shapes),

hannahbast avatar Jun 08 '22 12:06 hannahbast

Thank you for these questions and queries, Lorenz, I will reply to them one comment at a time.

Concerning your first query ("contains envelope"): This was realized using #413, which is based on an old version of QLever, where values used to be implemented very inefficiently as strings. In the meantime, QLever's handling of values has been completely refactored and is now much more efficient, see #648 and #650 (and some preparatory PRs). In a nutshell, values that fit into an 8-byte integer (that is, most values) are now represented directly in their ID instead of as strings like before.

Also note that since #638, QLever supports geof:latitude and geof:longitude for WKT points (example on Wikidata: all objects in a stripe around the 48°N latitude), as well as geof:distance (example on Wikidata: all objects in a 100km ring around Freiburg). These functions are not yet super fast because they parse the WKT strings at query time, but that will be relatively easy to change now with QLever's new value handling.

@joka921 How much do you consider it to adapt #413 to the current master? I would expect that much of the magic that we needed for #413 is not needed anymore because the values are now efficient out of the box already.

@LorenzBuehmann Is this a feature which you need urgently or were you just curious?

Hi there,

I also face same problem. So if query ("contains envelope") implemented refactored, how could I run it on the recent Qlever? Or I need to go back to which certain old version Qlever?

Thank you for help.

siwenyang avatar Dec 13 '22 03:12 siwenyang

@siwenyang Can you give an example of a query you wonder how to ask? As I wrote, there is now ogc:contains, ogc:intersects, geof:Latitude, geo:Longitude, and geof:Distance, which can do all the previous stuff and much more.

hannahbast avatar Dec 16 '22 20:12 hannahbast

@hannahbast

Hi there, here is another issue I opened and has query code in: https://github.com/ad-freiburg/qlever/issues/844

Thank you for help!

siwenyang avatar Dec 17 '22 03:12 siwenyang

@hannahbast as far as I understand, it is not about using the materialized ogc:contains triples but providing some polygon as envelope and check for containment - that PR has never made it to the current Qlever code not the public instance as far as I can see. Or can you give some example queries to check for containment as well as intersecting geometries? I tried to use it in a filter, but it failed - assuming a WKT literal as bounding box like "POLYGON((12.285156250000004 51.9187631095413,13.647460937500004 51.9187631095413,13.647460937500004 51.0982474605854,12.285156250000004 51.0982474605854,12.285156250000004 51.9187631095413))"^^geo:wktLiteral I could not get the filter working

Thanks in advance

LorenzBuehmann avatar Dec 18 '22 08:12 LorenzBuehmann

@LorenzBuehmann I think we should distinguish between two types of "contained in arbitrary given geometry" queries, and I have a question about both:

  1. Queries that ask for objects contained in an arbitrary given axis-parallel rectangle. These were described in our SIGSPATIAL'21 paper, as an approximation for exact "contained in region X" queries because at the time we couldn't do those efficiently yet. In the meantime, we can and I wonder what then the use case for "contained in arbitrary given axis-parallel rectangle is"? We will implement this again eventually (it's not hard, just work), but haven't given it a high priority so far. Maybe we would if we understood the use cases better.

  2. Queries that ask for objects contained in an arbitrary given geometry specified by the user. These were not discussed in our SIGSPATIAL'21 paper and it would be a very different problem. Also here I wonder: what is the use case for such queries?

hannahbast avatar Dec 18 '22 20:12 hannahbast

Hi @hannahbast .

Well, for me any client asking maybe only for an externally defined part of the whole world would be the most natural use case. Rectangle based lookup might sound weird at a first glance, but even Leaflet has the concept of tiles and might just want to render on demand. Another use-case - and that covers basically also non rectangles polygons would be whenever you make use of external datasets via SPARQL SERVICE clause to get geometries e.g. boundaries not being present in the current dataset. In that case, you clearly have to compute on-demand for e.g. containment and can't make use of materialized geospatial relations. Like I could use polygons from Wikidata not in loaded in my Qlever OSM dataset and vice versa.

As far as I know, most tools I've been working with use an R-Tree as index structure, and then the envelope of whatever geometry I use for querying is used to get the candidates from the R-Tree and in a secondary step only on those retrieved objects intersection, containment etc. is computed to get the correct result.

LorenzBuehmann avatar Dec 19 '22 07:12 LorenzBuehmann

@siwenyang Can you give an example of a query you wonder how to ask? As I wrote, there is now ogc:contains, ogc:intersects, geof:Latitude, geo:Longitude, and geof:Distance, which can do all the previous stuff and much more.

@hannahbast Hi, could you also kindly provide a workable example about ogc:intersects? Thank you so much!

siwenyang avatar Dec 21 '22 07:12 siwenyang

I tried to do an updated recap of this long thread:

Name Implemented? Efficiency geoSPARQL Domain/Range
?x ogc:sfContains ?y Yes Good no osm2rdf features
?x ogc:sfIntersects ?y Yes Good no osm2rdf features
FILTER + geof:distance (?x, ?y, ?units) No 8.9.15 ogc:geomLiteral
FILTER + geof:distance (?x, ?y) Only points Bad no geometry
?x geo:sfIntersects ?y No 7.2 geo:SpatialObject
?x geo:sfWithin ?y No 7.2 geo:SpatialObject
?x geo:sfContains ?y No 7.2 geo:SpatialObject
?x geo:sfOverlaps ?y No 7.2 geo:SpatialObject
FILTER + geof:latitude (?x) Yes Average no geometry (only points)
FILTER + geof:longitude (?x) Yes Average no geometry (only points)

Now, I'm currently trying to check whether a osmrel:*/osmway:*/osmnode:* element (not necessarily a point) is contained in a bounding box, but I'm currently able to do so only with points. ogc:sf* works only between OSM elements (not with arbitrary geometries) and geof:distance/geof:latitude/geof:longitude works only with points. If I understand correctly this answer from the comment above referred exactly to this use case:

  1. Queries that ask for objects contained in an arbitrary given axis-parallel rectangle. These were described in our SIGSPATIAL'21 paper, as an approximation for exact "contained in region X" queries because at the time we couldn't do those efficiently yet. In the meantime, we can and I wonder what then the use case for "contained in arbitrary given axis-parallel rectangle is"? We will implement this again eventually (it's not hard, just work), but haven't given it a high priority so far. Maybe we would if we understood the use cases better.

Looking at the paper, if I understand correctly you are referring to osm2rdf:contains_nonarea, but it's meant to link contained elements with pre-defined rectangles, not any arbitrary rectangle, and in fact my experiments using this predicate didn't work: https://qlever.cs.uni-freiburg.de/osm-planet/54NRUT .

Anyway, in general, has any predicate/function been implemented to this time that would allow bbox spatial queries?

My specific use case would be to fetch only the elements that have to be rendered in the portion of a map that is rendered on a screen, and using the latitudes and longitudes at the borders of the screen is the most natural lookup method. More in general using a bbox to filter objects is the most typical method of querying used to query OSM data, for example through other query engines like Overpass or through map tiles.

Danysan1 avatar Dec 22 '23 00:12 Danysan1

I think we need to keep track of Sophox vs QLever OSM mapping. The current Sophox mapping (How OSM data is stored).

nyurik avatar Jan 18 '24 03:01 nyurik

@nyurik Can you briefly explain how you get the latest data? We currently fetch planet-latest.osm.pbf from https://planet.openstreetmap.org/pbf/ whenever there is a new version. But unfortunately that file isn't updated very frequently. For example, as of today (18.01.2024), the latest version is from 12.01.2024, so almost a week old.

And I agree, we should keep track of how we map data to RDF. And ideally use the same mapping.

hannahbast avatar Jan 18 '24 03:01 hannahbast

All the code is in https://github.com/Sophox/sophox/tree/main/osm2rdf -- that tool subscribes to minutely updates, and generates either TTL files (from full dump) or SPARQL insert statements (from updates)

nyurik avatar Jan 18 '24 04:01 nyurik

@nyurik Thanks! One quick question: how do you keep track of:

  1. From what point in time on you want updates
  2. Which updates you have already processed

I am assuming that your code also works if it's not constantly running.

BTW, I wasn't aware until now that your converter is also called osm2rdf. When we started our work several years ago, we were aware of Sophox and also looked at the code and ran it. But I don't remember seeing anything named osm2rdf.

hannahbast avatar Jan 18 '24 04:01 hannahbast

i had it named osm2rdf from the start, and it was part of the https://github.com/Sophox/sophox/tree/main/osm2rdf repo - I think i wrote most of it 6 years ago :)

I store sequence number in the index itself - https://github.com/Sophox/sophox/blob/main/osm2rdf/RdfUpdateHandler.py#L93 - that sequence value is what osmosis code and its python wrapper use to manage updates. Sadly, there is no well defined algorithm as far as i know to convert a timestamp to a sequence number (it should be possible though). A better path forward IMO would be to work together on the rust version of osm2rdf that already handles most of the rapid TTL/SPARQL generation, and just needs to gain a way to determine which update files (first daily, followed by hourly / minutly) to get

nyurik avatar Jan 18 '24 04:01 nyurik

@nyurik Working together sounds good. One focus of our tool was getting the full geometry for each OSM object (currently as a WKT literal, but this could also be in any other format, for example, WKB). Which is non-trivial to make efficient because you need to store all the point locations (very many) and then gather the ones you need for each object.

On https://sophox.org/ only the centroids of each object are available (via the osmm:loc predicate). Have you thought about providing the full geometries as well or is this already part of your new Rust-based osm2rdf?

hannahbast avatar Jan 18 '24 04:01 hannahbast

full geometry is fairly simple to do in the osm2rdf, but not so easy in the index itself. Blazegraph doesn't support it (afaik), that's why i simply drop it, but it should be relatively easy to keep them, at least for ways. So if QLever supports it -- sure, osm2rdf can keep it. Please join OSM-US slack -- we can chat there (same nick, or #sophox channel)

nyurik avatar Jan 18 '24 04:01 nyurik