placeholder
placeholder copied to clipboard
Disambiguation removal logic truncates remaining search text
Tokenization has a regex that removes disambiguation markers. This can be helpful, but it's currently truncating remaining text after the first occurrence of any disambiguation character.
So: "Portland (Oregon) USA" becomes "portland" "Borivali (West), Mumbai, India" becomes "borivali" etc
Should it just remove the disambiguation part and leave the rest of the text as it is? (I understand that's going to be impossible where we don't have a "closing" disambiguation marker - e.g. just a simple "-")
Or is it just easier to remove the marker characters only and leave the disambiguation text as it is?
For example, change the regex to
input.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ');
With this: "Portland (Oregon) USA" becomes "portland oregon usa" "Borivali (West), Mumbai, India" becomes "borivali west india"
Referring to this: https://github.com/pelias/placeholder/pull/49/commits/d7de4c9bcedfc4a1670480f339b5b4ff246ae373#diff-b1c9f1b1a4d867ea6fd37744bd1b38e5
I believe this is correct as-is. The intention is to remove all parts of the text which aren't the 'subject'.
So in the case of "Borivali (West), Mumbai, India" we are only looking for 'Borivali', the additional tokens which help localize it to Mumbai India shouldn't be included in the index.
The associations to Mumbai & India should be made via their hierarchical links instead, so that we understand the parent-child relationship of these tokens.
Can you provide an example of a query which is currently failing due to this?
How we have it currently allows us to show a clear hierarchy of the tokens:
Borivali neighbourhood 85933015
└ Mumbai locality 102030609
└ Mumbai City MU county 890503073
└ Maharashtra MH region 85672171
└ India IND country 85632469
└ Asia continent 102191569
Because of the truncation, searching for "Portland (Oregon) USA" yields match from Jamaica as well.

And searching for "Borivali (East), MH, India" yields Borivali West as the first match.

"3 Store, 311-318 High Holborn, London, WC1V 7BN, UK" returns no matches. Instead of returning the following (screenshot taken from a modified instance where I removed the disambiguation regex)

Similarly, "1313 1/2 Railroad Ave Bellingham WA 98225-4729" returns no matches.
"St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ" returns a wrong result.
"〒100-8994, 東京都 中央区 八重洲一丁目 5番3号 東京中央郵便局, Japan" returns no result.
There may be some more examples. I took some here from Falsehoods.
The main problem I see is that truncating at a disambiguation character removes all trailing address information - the lineage - which is crucial in determining the location.
BTW, what I did was replace these characters with a space.
text = text.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ').trim();
My guess is that giving more tokens to Placeholder, will allow it to perform a better match. And it seems to be working well with it.
Oh I see we were talking about slightly different topics.
The original intention of the regex was to fix erroneous data at import-time.
It seems we are using the same analysis at query-time that we're using at index-time and so maybe you're right, we might consider making them separate analyzers so they can have different functions.
Thanks for the examples, they are certainly helpful, although I don't expect us to be able to handle all the edge cases from that Falsehoods post because this library doesn't have any awareness of addresses.
Awesome.
And sure, I wouldn't expect Placeholder to handle different address oddities. Placeholder should stay focused on "last line parsing".