placeholder Disambiguation removal logic truncates remaining search text

Tokenization has a regex that removes disambiguation markers. This can be helpful, but it's currently truncating remaining text after the first occurrence of any disambiguation character.

So: "Portland (Oregon) USA" becomes "portland" "Borivali (West), Mumbai, India" becomes "borivali" etc

Should it just remove the disambiguation part and leave the rest of the text as it is? (I understand that's going to be impossible where we don't have a "closing" disambiguation marker - e.g. just a simple "-")

Or is it just easier to remove the marker characters only and leave the disambiguation text as it is?

For example, change the regex to input.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ');

With this: "Portland (Oregon) USA" becomes "portland oregon usa" "Borivali (West), Mumbai, India" becomes "borivali west india"

Referring to this: https://github.com/pelias/placeholder/pull/49/commits/d7de4c9bcedfc4a1670480f339b5b4ff246ae373#diff-b1c9f1b1a4d867ea6fd37744bd1b38e5

Oct 03 '19 06:10 niravmehta

I believe this is correct as-is. The intention is to remove all parts of the text which aren't the 'subject'.

So in the case of "Borivali (West), Mumbai, India" we are only looking for 'Borivali', the additional tokens which help localize it to Mumbai India shouldn't be included in the index.

The associations to Mumbai & India should be made via their hierarchical links instead, so that we understand the parent-child relationship of these tokens.

Can you provide an example of a query which is currently failing due to this?

Oct 03 '19 10:10 missinglink

How we have it currently allows us to show a clear hierarchy of the tokens:

Borivali neighbourhood 85933015
└ Mumbai locality 102030609
   └ Mumbai City MU county 890503073
      └ Maharashtra MH region 85672171
         └ India IND country 85632469
            └ Asia continent 102191569

Oct 03 '19 10:10 missinglink

Because of the truncation, searching for "Portland (Oregon) USA" yields match from Jamaica as well.

And searching for "Borivali (East), MH, India" yields Borivali West as the first match.

"3 Store, 311-318 High Holborn, London, WC1V 7BN, UK" returns no matches. Instead of returning the following (screenshot taken from a modified instance where I removed the disambiguation regex)

Similarly, "1313 1/2 Railroad Ave Bellingham WA 98225-4729" returns no matches.

"St. Judes & St. Pauls C of E (Va) Primary School, 10 Kingsbury Road, London, N1 4AZ" returns a wrong result.

"〒100-8994, 東京都中央区八重洲一丁目 5番3号東京中央郵便局, Japan" returns no result.

There may be some more examples. I took some here from Falsehoods.

The main problem I see is that truncating at a disambiguation character removes all trailing address information - the lineage - which is crucial in determining the location.

Oct 04 '19 05:10 niravmehta

BTW, what I did was replace these characters with a space.

text = text.replace(/[-֊־‐‑﹣\(\)\[\]]/g, ' ').trim();

My guess is that giving more tokens to Placeholder, will allow it to perform a better match. And it seems to be working well with it.

Oct 04 '19 05:10 niravmehta

Oh I see we were talking about slightly different topics.

The original intention of the regex was to fix erroneous data at import-time.

It seems we are using the same analysis at query-time that we're using at index-time and so maybe you're right, we might consider making them separate analyzers so they can have different functions.

Thanks for the examples, they are certainly helpful, although I don't expect us to be able to handle all the edge cases from that Falsehoods post because this library doesn't have any awareness of addresses.

Oct 04 '19 17:10 missinglink

Awesome.

And sure, I wouldn't expect Placeholder to handle different address oddities. Placeholder should stay focused on "last line parsing".

Oct 05 '19 05:10 niravmehta

placeholder placeholder copied to clipboard

Disambiguation removal logic truncates remaining search text

placeholder
placeholder copied to clipboard