parser Right single quotation mark in node name causes it to be unsearchable in autocomplete

Describe the bug

When searching via autocomplete for this place (В’ячеслава Чорновола вулиця 8), no results are returned. However, reverse geocoding does return the place.

In the autocomplete request I see that the parser disregards everything in front of the right single quotation mark(’), causing the subject to be 8 ячеслава Чорновола вулиця (street: ячеслава Чорновола вулиця, housenumber: 8).

I have tested with a different place that has apostrophe instead in it's name and it works as expected:

searched for this place (П'ятихатки Вулиця 11)
autocomplete works as expected, the place is returned
the parser has the subject 11 П'ятихатки Вулиця (street: П'ятихатки Вулиця, housenumber: 11).
reverse geocode also works

Is this a parser issue or the schema is also affected? I see in this file that this character is not included.

Steps to Reproduce Search for a place that has a right single quotation mark in it's name using autocomplete

Expected behavior Expected the place to be returned since it exists in the database.

Environment (please complete the following information): NA

Pastebin/Screenshots NA

Additional context NA

References

NA

Sep 14 '22 06:09 BrindusaN

Is it one of these quotes? https://github.com/pelias/parser/blob/master/tokenization/split_funcs.js#L10

The Pelias parser treats those quotes as word boundaries, although there is a code comment below noting that this should only be for quote pairs.

Sep 14 '22 08:09 missinglink

I'm not sure if this is a data error or a code error, surely 'apostrophe' is the correct character to use?

a mark ' used to indicate the omission of letters or figures

The same dictionary describes a quotation mark as:

used chiefly to indicate the beginning and the end of a quotation in which the exact phraseology of another or of a text is directly cited

Sep 14 '22 08:09 missinglink

Hi,

Yes, it is one of the characters in the split_funcs.

AFAIK the right single quotation mark can be used in some languages to alter the sound of a letter (a diacritical mark). [Wikipedia](https://en.wikipedia.org/wiki/Right_single_quotation_mark#:~:text=The%20Unicode%20character%20'%20(U%2B,right%20(closing)%20quotation%20mark.) describes a right single quotation mark as:

The Unicode character ’ (U+2019 right single quotation mark) is used both for a typographic apostrophe and a single right (closing) quotation mark.

Both the apostrophe and the right single quotation mark are modifier letters. It is used in Ukrainian language.

Sep 14 '22 15:09 BrindusaN

Agh ok, thanks for posting that link, we're definitely in this situation of "difficulty of software distinguishing which character is intended by a user's typing".

I don't have the time to work on this right now but I'd be fine with removing it from the quotes array, question is, will that break anything?

A more robust solution would involve splitting these quotes into opening/closing pairs and only considering them as word boundaries when both exist in the text, although this may cause issues with autocomplete.

Sep 15 '22 07:09 missinglink

parser parser copied to clipboard

Right single quotation mark in node name causes it to be unsearchable in autocomplete

References

parser
parser copied to clipboard