schema icon indicating copy to clipboard operation
schema copied to clipboard

hyphenated names

Open missinglink opened this issue 9 years ago • 10 comments

names such as 51 Friedrich-Richter-Straße (address-osmnode-2967205513) should be searchable using the tokens ['friedrich','richter','strasse'] as well as ['friedrichrichterstrasse'] and ['friedrich-richter-strasse']

missinglink avatar Jul 02 '15 08:07 missinglink

see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html

missinglink avatar Jul 02 '15 08:07 missinglink

this is how the peliasTwoEdgeGram currently tokenizes that address: [ '51', 'fr', 'fri', 'frie', 'fried', 'friedr', 'friedri', 'friedric', 'friedrich', 'friedrich-' ]

missinglink avatar Jul 02 '15 08:07 missinglink

Leonardo da Vinci–Fiumicino Airport should be searchable by Fiumicino Airport http://pelias.mapzen.com/doc?id=geoname:6299619

missinglink avatar Jul 02 '15 14:07 missinglink

Add acceptance-tests in order to gauge impact.

dianashk avatar Jul 22 '15 15:07 dianashk

Just checked, this is still an area we could improve. Something to think about for the near-ish future

orangejulius avatar Jan 28 '16 14:01 orangejulius

This feature will require alt-names as the street name above can have 3 forms:

Friedrich-Richter-Straße
Friedrich Richter Straße
FriedrichRichterStraße

moving to alt-names milestone as it can only be solved for a maximum of 2 cases before then.

missinglink avatar Aug 03 '16 13:08 missinglink

I am facing a similar (maybe simpler ?) issue with french names. A search for stade roland-garros should return similar results as stade roland garros

Would it help to add a hyphen - in the tokenizers pattern ? (see https://github.com/pelias/schema/blob/master/settings.js#L18) ? Or would that cause serious regressions with other languages ?

amatissart avatar Dec 07 '17 14:12 amatissart

As of the last time we checked in, we were waiting for good alt-names support before tackling this feature. We now have that functionality, and its worth looking at this again.

My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.

Some questions: 1.) would we want to tokenize on hyphens, or handle them in some different way? 2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.

orangejulius avatar Aug 27 '18 01:08 orangejulius

My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.

Yes, that sounds correct

1.) would we want to tokenize on hyphens, or handle them in some different way?

I think tokenizing on hyphens would work, so long as we can handle the issues that tokenizing brings with it (such as not matching main st with main ave but at the same time matching E main st with W main st).

2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.

I would be hesitant to put this logic in pelias/model, it's clearly super convenient but it might be better to have the code closer to the data (in the importer) so the importer could make data-specific decisions about it's data conventions and optionally apply locale-aware logic which is specific only to certain languages or geographies.

The other option would be to pass the locale information down to the pelias/model code so that it was able to work with that metadata.

missinglink avatar Aug 30 '18 17:08 missinglink

Hi @Joxit, Yes, it's long past time we merge this change or something like it. Let us run a quick full planet build with this branch and take a look. Pretty sure it will be something we can merge right away.

I'll let you know tomorrow :)

edit: oops, this was supposed to be a comment on #375

orangejulius avatar Sep 09 '19 16:09 orangejulius