api icon indicating copy to clipboard operation
api copied to clipboard

Deduplication issue with transit stops

Open bboure opened this issue 3 years ago • 5 comments

Hi there,

I encountered an issue concerning the dedupe strategy. Autocomplete/search for "Manneken Pis" does not return the little peeing guy I am expecting. Instead, only the Bus stop is returned.

After some debugging, I found that they are considered duplicates:

  • they have the same layer and parent hierarchy
  • they both have the same name in some languages
  • the statue node does not have an address or postal code
  • The bus stop does have an address and postal code (if one element does not have an address they are considered equal)

Then, the bus stop is preferred because it has a zipcode.

This issue can possible happen many times as bus/metro/train stops often have the name of a nearby famous venue.

Suggestion: Should we add a dedupe rule, maybe on category and/or addendum? venues with different categories should not be considered as duplicates. Although, this could generate real duplicates since venues on osm are often duplicated and do not necessarily have the same tags.

Alternatively, another solution could also involve popularity. In this case, the statue has a higher popularity than the bus stop. But if you are actually looking for the bus stop, then this does not work either.

Any idea?

Thanks

https://pelias.github.io/compare/#/v1/autocomplete?focus.point.lat=50.843183&focus.point.lon=4.371755&text=Manneken+pis&debug=0

bboure avatar Jul 06 '20 15:07 bboure

Hi @bboure, Thanks for another well researched and described issue. I actually have seen the exact same behavior, specifically affecting transit stops. Transit is a case where specifically returning the transit stop, not merely another record of the same name, even if very nearby, is important, so we should probably fix this.

I agree with you that the best solution is probably to not consider records duplicates if they have different category values.

Deduplicating based on addendum data is an interesting idea. I could see it leading to a lot of noise, but also being useful, especially for custom data brought in through the csv-importer.

If you want to add logic to consider records with different category values ineligible for deduplication, I think we would gladly accept that PR.

@missinglink any thoughts here?

orangejulius avatar Jul 07 '20 16:07 orangejulius

Yeah agreed with what you both said, some thoughts..

  • We should return both places, not just one, in this case
  • I'd prefer if the addendum wasn't used in any business logic, it wasn't designed for that and the code could turn out messy if everyone starts using it to if/else things.
  • The categories feature should really be more widely documented and used, because...
  • The venue layer by itself doesn't adequately describe the diversity of things which are contained in that layer, for this we need a taxonomy (categories)

missinglink avatar Jul 07 '20 17:07 missinglink

Dang, couldn't have said it any better, each of those points is spot on

orangejulius avatar Jul 07 '20 17:07 orangejulius

Thank you both for you feedback. I'll open a PR.

Question: When should we consider both records as duplicates? Should categories be completely equals? (array of same length and same content) or should we consider that if any of the categories match, they are duplicates?

And what if one of the venues' category is empty?

bboure avatar Jul 08 '20 07:07 bboure

@orangejulius @missinglink I had some additional thoughts about something that may be a bit out of scope here, but related.

In the case 2 records are considered the same, when it gets to the isPreferred() function, and both records come from osm, should we prefer ways over relations, and relations over nodes?

I have seen some places duplicated in osm, and generally one of the duplicates is an old node that was replaced by a relation or a way.

bboure avatar Jul 09 '20 12:07 bboure