parser
parser copied to clipboard
Right single quotation mark in node name causes it to be unsearchable in autocomplete
Describe the bug
When searching via autocomplete for this place (В’ячеслава Чорновола вулиця 8
), no results are returned. However, reverse geocoding does return the place.
In the autocomplete request I see that the parser disregards everything in front of the right single quotation mark(’
), causing the subject to be 8 ячеслава Чорновола вулиця
(street: ячеслава Чорновола вулиця
, housenumber: 8
).
I have tested with a different place that has apostrophe instead in it's name and it works as expected:
- searched for this place (
П'ятихатки Вулиця 11
) - autocomplete works as expected, the place is returned
- the parser has the subject
11 П'ятихатки Вулиця
(street:П'ятихатки Вулиця
, housenumber:11
). - reverse geocode also works
Is this a parser issue or the schema is also affected? I see in this file that this character is not included.
Steps to Reproduce
Search for a place that has a right single quotation mark
in it's name using autocomplete
Expected behavior Expected the place to be returned since it exists in the database.
Environment (please complete the following information): NA
Pastebin/Screenshots NA
Additional context NA
References
NA
Is it one of these quotes? https://github.com/pelias/parser/blob/master/tokenization/split_funcs.js#L10
The Pelias parser treats those quotes as word boundaries, although there is a code comment below noting that this should only be for quote pairs.
I'm not sure if this is a data error or a code error, surely 'apostrophe' is the correct character to use?
a mark ' used to indicate the omission of letters or figures
The same dictionary describes a quotation mark as:
used chiefly to indicate the beginning and the end of a quotation in which the exact phraseology of another or of a text is directly cited
Hi,
Yes, it is one of the characters in the split_funcs
.
AFAIK the right single quotation mark can be used in some languages to alter the sound of a letter (a diacritical mark). [Wikipedia](https://en.wikipedia.org/wiki/Right_single_quotation_mark#:~:text=The%20Unicode%20character%20'%20(U%2B,right%20(closing)%20quotation%20mark.) describes a right single quotation mark as:
The Unicode character ’ (U+2019 right single quotation mark) is used both for a typographic apostrophe and a single right (closing) quotation mark.
Both the apostrophe and the right single quotation mark are modifier letters. It is used in Ukrainian language.
Agh ok, thanks for posting that link, we're definitely in this situation of "difficulty of software distinguishing which character is intended by a user's typing".
I don't have the time to work on this right now but I'd be fine with removing it from the quotes array, question is, will that break anything?
A more robust solution would involve splitting these quotes into opening/closing pairs and only considering them as word boundaries when both exist in the text, although this may cause issues with autocomplete.