parser icon indicating copy to clipboard operation
parser copied to clipboard

Right single quotation mark in node name causes it to be unsearchable in autocomplete

Open BrindusaN opened this issue 2 years ago • 4 comments

Describe the bug

When searching via autocomplete for this place (В’ячеслава Чорновола вулиця 8), no results are returned. However, reverse geocoding does return the place.

In the autocomplete request I see that the parser disregards everything in front of the right single quotation mark(), causing the subject to be 8 ячеслава Чорновола вулиця (street: ячеслава Чорновола вулиця, housenumber: 8).

I have tested with a different place that has apostrophe instead in it's name and it works as expected:

  • searched for this place (П'ятихатки Вулиця 11)
  • autocomplete works as expected, the place is returned
  • the parser has the subject 11 П'ятихатки Вулиця (street: П'ятихатки Вулиця, housenumber: 11).
  • reverse geocode also works

Is this a parser issue or the schema is also affected? I see in this file that this character is not included.

Steps to Reproduce Search for a place that has a right single quotation mark in it's name using autocomplete

Expected behavior Expected the place to be returned since it exists in the database.

Environment (please complete the following information): NA

Pastebin/Screenshots NA

Additional context NA

References

NA

BrindusaN avatar Sep 14 '22 06:09 BrindusaN

Is it one of these quotes? https://github.com/pelias/parser/blob/master/tokenization/split_funcs.js#L10

The Pelias parser treats those quotes as word boundaries, although there is a code comment below noting that this should only be for quote pairs.

missinglink avatar Sep 14 '22 08:09 missinglink

I'm not sure if this is a data error or a code error, surely 'apostrophe' is the correct character to use?

a mark ' used to indicate the omission of letters or figures

The same dictionary describes a quotation mark as:

used chiefly to indicate the beginning and the end of a quotation in which the exact phraseology of another or of a text is directly cited

missinglink avatar Sep 14 '22 08:09 missinglink

Hi,

Yes, it is one of the characters in the split_funcs.

AFAIK the right single quotation mark can be used in some languages to alter the sound of a letter (a diacritical mark). [Wikipedia](https://en.wikipedia.org/wiki/Right_single_quotation_mark#:~:text=The%20Unicode%20character%20'%20(U%2B,right%20(closing)%20quotation%20mark.) describes a right single quotation mark as:

The Unicode character ’ (U+2019 right single quotation mark) is used both for a typographic apostrophe and a single right (closing) quotation mark.

Both the apostrophe and the right single quotation mark are modifier letters. It is used in Ukrainian language.

BrindusaN avatar Sep 14 '22 15:09 BrindusaN

Agh ok, thanks for posting that link, we're definitely in this situation of "difficulty of software distinguishing which character is intended by a user's typing".

I don't have the time to work on this right now but I'd be fine with removing it from the quotes array, question is, will that break anything?

A more robust solution would involve splitting these quotes into opening/closing pairs and only considering them as word boundaries when both exist in the text, although this may cause issues with autocomplete.

missinglink avatar Sep 15 '22 07:09 missinglink