cosmogony
cosmogony copied to clipboard
Ontology starting point
Here's the issue to start the discussion about schema of our zones "hierarchy".
The aim of this issue is to fill the concerned section in the README
here are my non structured thoughts:
categories
I like libpostal categories, libpostal is quite a reference in the address parsing world and we can hope their categories can handle all the countries specificities all around the world, but I don't think it handles all the corner cases (and it's not the only category out there, for example Wof uses another).
libpostal does not handle non administrative regions apart from the suburb
(and maybe the country_region
). So it would be difficult to represent Marne-la-Vallée or parc du mercantour
There is also the question of postal codes. I don't know whereas we could/should have postal codes zones in the hierarchy (should we create a separate issue for this ?)
Pyramidal hierarchy or graph-based ?
Can a zone have at most one parent or can it have several.
I fill that it might be a failing of Wof to have a pyramidal hierarchy. I don't think it will complicate cosmogony that much to be able to have several parents. I don't think it's useful for purely administrative regions (but maybe there are countries where it's relevant), but for non-administrative regions I think a pyramidal hierarchy will be too restrictive.
Eg. what would we link Marne-la-Vallée to ? ile de france ? but then it would be difficult to link it back to the cities that are part of it. The same apply for non official suburbs that can span across several district
links coherence
Wof hierarchy is nice, but being linked to all parents brings incoherence (like france empire that contains france country but the empire has less descendant than the country. I fill like outputting only the first level of relationship force the dataset to be coherent (even if so it will make the dataset harder to use without tools)
more thoughts on the subject:
I fill that a rigid tree hierarchy works well for the administrative regions.
a suburb
can have at most one city_district
(or none and be linked directly to a city
), and it's the same between the city
-> state_district
-> state
-> country_region
-> country
.
The categories are optional (maybe apart from the country
), and we can image places where child of a country are heterogeneous (cites, states, ...).
This model does not however work for at least 2 examples (feel free to add counter examples, I'm sure there are more):
A non administrative region that regroup others administrative regions
eg Marne-la-Vallée that is a group of cities in france. It has no administrative meaning, but is well know by locals.
A non administrative region that intersect many admins
eg.le marais, a french touristic zone that span across parts of 2 paris districts
eg. la defense near paris span over parts of 5 cities.
it can also happen with non official neighborhood that cross several district
s
neither the non administrative zone nor the administrative zone contains the other, they just intersect.
Rough idea on how to handle those
I think it's nice for any zone (administrative of not) to have administrative parents as it's helpful to know that Marne la vallée
is part of île de france
thus part of france
One idea would be, for any zone to have:
- at most one administrative region as parent
- some soft links to non-administrative regions
As a starting point I think we can use the same algorithm for both relationship: You need to be exactly included in administrative region shape to be part of it, and you need to be exactly included in a non-administrative shape to be linked to it.
what does that means
Marne-la-Vallée
All the cities of Marne-la-Vallée
are soft linked to it but are part of Seine-et-Marne
.
Le marais
There is no soft link between Le Marais
and either the 3rd paris district nor the 4th paris district, because there is no inclusion, Le Marais
is just part of Paris
implication for the use cases
attaching zones to a point
To attach zones to a point (like we need to do in a geocoder), we'll search for all leaf-zone that contains the point.
As a first implementation, we can even just search for all zones that contains the point and filter the leaf (so lowest level admin + all non related non-administrative zones)
finding the most meaningful zone for a point
I don't know :wink:
limitations
- we feel that postal code zones (are postal codes zones relevant anyway ?) might fit in the proposed model, but we're not sure. As we're not sure to need them anyway I don't feel that to be very important * soft links might not be very useful (at least for the geocoder point of view)
- a library is needed to extract info (like get all the cities of france, get the hierarchy of
le marais
, get all the cities that are part ofMarne la Vallée
), thus the raw dataset is a bit harder to use
Soft links
I like the idea to have a strong representation where everybody can hang onto, and something less organized for local specific things.
Post codes
A post code could be a soft zone, right? If we ever get a shape of those (that might be a problem in many places of the world) then you could be precise, otherwise a list should be sufficient.
Local knowledge
As you mentioned, le Marais is better known by the locals than the administrative quarters.
The problem here is that there is no specific border, but rather fuzzy.
An attribute on the zone could help here, even if we don’t know how to use the data (or represent it on OpenStreetMap).
Broad internationnal agreements
It would also be nice to be able to represent the Schengen zone.
External knowledge
The airport of Paris is not in Paris, and tourists think that the Château de Versailles is in Paris.
This would mean that the ontology has a notion on who’s asking?
Can it be a tree?
With strict inclusions, we will have a tree :palm_tree:, which is nice.
The obvious
The lowest level must be variable. The commune
Paris is divided in arrondissements
and each arrondissement
is divided in quartier administratif
. Most French commune
are the lowest subdivision.
Useless trivia: Google believes there is a quarter :banana: in Paris. Unknown by the inhabitants and not an official one neither https://encrypted.google.com/search?hl=en&q=quartiers%20la%20banane%20paris
The easy
When going from the lowest to the highest, the system needs to have holes. For instance the commune
Nantes belongs to the Métropole
Nantes, but not every commune
belongs to a Métropole
.
I don’t think it is a problem.
Useless trivia : this island is under direct authority of a ministry, with no intermediate administration https://en.wikipedia.org/wiki/Clipperton_Island
The challenge
Åland belongs to Finland. Finland belongs to the European Union, yet Åland does not belong to the European Union (yay! cheap booze :champagne:)
For the vast majority of situations, this can be ignored. Maybe it could be handled with an explicit exception once the big work is done.
There seem to be surprising few situations of that kind https://en.wikipedia.org/wiki/Dependent_territory
The pain
A territory can be under the sovereignty of two countries, like https://en.wikipedia.org/wiki/Pheasant_Island
Ok. I think we can ignore this one.
If I might be so bold, the 🍌 area actually does exist and is known to, at least some, inhabitants.
This could be a typical example of how different people view the same area differently.
Indeed, I should not take my ignorance as a general rule. I would be curious to know where the data from Google comes from
Regarding the tree structure, not sure it works. It can be a DAG though I think.
Take postal codes for example. In France you will have, potentially, several communes
to a single postcode, but in the UK, many cities have more than one postcode. If you want to handle this worldwide, you need a separate branch for postcodes from that of admin, imo.
hum for postal codes, don't you think soft links (so outside the official hierarchy) would be enough ?
You're right, the tree Vs DAG is really an important question, we really need to think about this carefully
Some thoughts concerning wikidata.
It is a database closely linked to Wikipedia defining semantic relations between objects.
The licence is CC0, so that won’t be a problem.
With OSM as the geographic leg, Wikidata as the semantic one, Cosmogony should be able to have all the needed informations.
Stable ID
First obvious benefit: the ID will probably be much more stable than OSM elements or even Wikipedia pages.
It handles the historization of elements, meaning that an Id will not be recycled for a new object (e.g. two communes that merge).
Paris will always be Q90.
The wikidata ID should be in the OSM object tags. We should do a batch to have the order of magnitude of admin
objects without a wikidata id.
Higher confidence when building the hierarchy
There is already an hierarchy with the property P131 that indicates the belonging to a larger zone.
This could avoid some wrong hierarchies that would be only detected through geographical inclusion (simplified borders, weird enclaves…)
Contribute to a good database
Wikidata has good chances to keep working over time. Any hand made fix will therefore stay there for good and will help to improve commons.
This will reduce the need of adhoc databases.
Manipulating the data
The dump is 20Gb large. This will be a problem for someone working on a small territory.
My guess it that it will be very easy to generate a subset that focuses on the admin regions.
I also had a look at wikidata as a potential source of information to build the hierarchy. I agree stable IDs would be useful.
P131 is promising, but does not seem so easy to use. Its definition is unclear and I can easily find inconsistencies in the data. See Quimper (Q342) :
- it's both linked to the Arrondissement and the Departement : this information is redundant since P131 is supposed to be transitive (but that's maybe a minor issue)
- it's also linked to 3 different Cantons, that does not contain the city obviously, and are not even subparts of it (as they include other communes).
A similar issue is visible with Marne-la-Vallée (Q1886380) (we really like that example ^^):
3 departments are listed under P131 property, although Marne-la-Vallée just overlaps a part of them.
Note : here is the SPARQL query I used, to find geographical entities with multiple P131 statements.
Anyway, we have two candidate approaches to build the hierarchy (geographical inclusion and wikidata). We may choose one, and create some QA checks or additional tools to test our data against the other one.