Add new dedup feature based on name and location
Background
There is a lot of elements which have the same name and are close but with a different hierarchy. When we do import with WOF and Geonames, there are some duplicate items and they are not deleted with the current API. Sometimes what matters most is the label with coordinates and not properties.
What I did
For this, I added a new query parameter dedupe=geo to remove nearby items. This will mark as similar all items with the same name and which are in less than 1km away.
I'm really bad in naming (parameters/variables/constants...) :sweat_smile:
Another PR we forgot, sorry!
This could be really useful. Honestly, I don't think it would help much with admin areas between WOF and Geonames, as they're often off by many km.
However it could be a really nice additional guard against incorrectly deduplicating addresses, and it would allow us to expand our dedupe logic to cover more cases (like common abbreviations as your test case mentions). So maybe we rework this to only apply to addresses, and then we can drop the additional parameter?
Yes, this is not only for WOF and Geonames dedup, it's for a specific use-case where the most important is the name and the position.
For example you are the Uniqlo website (I don't know why I chose this one :sweat_smile:), you have a ton of stores in your country, and you ask to your customer to type where he is to find the nearest shop. He is in Amiens in France. The answer is :
1) Amiens, France
2) arrondissement d'Amiens, France
3) Amiens, France
4) Amiens, SK, Canada
5) Amiens, QL, Australie
There are two Amiens a localadmin and a locality (whosonfirst:localadmin:1159322001 and whosonfirst:locality:101901021), both are close and the user will wonder why there are two Amiens in the same place...
This can also happen with a venue with the name of the city for example...
In fact, this PR remove duplicates caused by a wrong WOF hierarchy, for example this test was fixed with pelias/whosonfirst#471 https://github.com/pelias/api/blob/31dbbd6fd5f5bf02baa6147969ea057e84b5f544/test/unit/middleware/dedupe.js#L643
The Amiens example is also a wrong WOF data hierarchy. That's why I'm not as sure as I was with this feature :thinking:
Address dedup will be painful and funny, I have a good example here 20 Avenue de la République, Paris where Av<=>Avenue
For the record:
1) 20 Av De La Republique, Paris, France (openaddresses:address:fr/paris:3cd011103561aea8)
2) 20 Avenue de la République, Paris, France (openstreetmap:address:node/1225597358)
Hi there, I did some check for this PR and this is still useful.
My example is Place de la Bastille, I have 3 Place de la Bastille in my result and they are close to each other.
0) Place de la Bastille, Paris, France
1) Place de la Bastille, Paris, France
2) Place de la Bastille, Paris, France
3) Place de la Contrescarpe, Paris, France
4) Place de la Nation, Paris, France
5) Place de la Nation, Paris, France
6) Place de la Sorbonne, Paris, France
7) Place de la Réunion, Paris, France
8) Place de la République, Paris, France
9) Place de la République, Paris, France
10) Place de la République, Paris, France
11) Place de la Coupole, Charenton-le-Pont, France
12) Place de la Bourse, Paris, France
13) Place de la Fraternité, Montreuil, France
14) Place de la République, Montreuil, France
15) Place de la Garenne, Paris, France
16) Place de la République, Le Kremlin-Bicêtre, France
17) Place de la Concorde, Paris, France
18) Place de la Concorde, Paris, France
19) Place de la Madeleine, Paris, France
with this PR the result is
Place de la Bastille, Paris, France
Place de la Contrescarpe, Paris, France
Place de la Nation, Paris, France
Place de la Sorbonne, Paris, France
Place de la Réunion, Paris, France
Place de la République, Paris, France
Place de la Coupole, Charenton-le-Pont, France
Place de la Bourse, Paris, France
Place de la Fraternité, Montreuil, France
Place de la République, Montreuil, France
Place de la Garenne, Paris, France
Place de la République, Le Kremlin-Bicêtre, France
Place de la Concorde, Paris, France
Place de la Madeleine, Paris, France
I'm open to feedback (name for the query parameter...)
My point of view changed for this PR, I think updating the label is often better than merging them by location. I will close the PR for now