api icon indicating copy to clipboard operation
api copied to clipboard

use libpostal parses for venue queries where available

Open missinglink opened this issue 6 years ago • 6 comments

We're currently not using libpostal parses for venues, if we see a venue parse we're falling back to the native parser. I don't remember the history of this but it seems wrong to me 🤷‍♂

I noticed this when looking into some bug reports, one example being "Café Pelias". There are two things currently going wrong with this query:

  • libpostal parses it correctly but the query fails with the message No query to call ES with. Skipping
  • upon falling back to the native parser it parses it incorrectly (I'll open a separate issue for this on that repo)

So regarding the first point, I don't see why we would throw away the venue parse here from libpostal:

"parsed_text": {
  "query": "Café Pelias"
}

The query label has actually been mapped from the libpostal house field in controller/libpostal, but this field indicates a venue name.

As you can see from the PR edits, we don't currently consider these venue parses for query generation and I'm not sure why, I believe that libpostal is still superior to the native parser when it comes to venue queries and has always been better than addressit was?

Thoughts?

missinglink avatar Nov 08 '19 10:11 missinglink

The only reason I can think of for the existing behaviour is if libpostal erroneously identifies things as house and we were trying to guard against that?

missinglink avatar Nov 08 '19 10:11 missinglink

I rebased this and put it up on dev today, it fixes the "vanity addresses" issue we've been discussing:

Screenshot 2020-09-23 at 11 33 39 Screenshot 2020-09-23 at 11 33 04 Screenshot 2020-09-23 at 11 35 50

cc/ @blackmad

missinglink avatar Sep 23 '20 09:09 missinglink

linked https://github.com/pelias/acceptance-tests/pull/533

missinglink avatar Sep 23 '20 10:09 missinglink

I ran the full acceptance test suite on this today and there were actually quite a few improvements, but at the same time it highlighted some issues.

diff of changes vs. production: https://www.diffchecker.com/5Faotyih (ignore any errors related to /v1/reverse)

screenshots of some issues inherited from libpostal:

Screenshot 2020-09-23 at 15 26 06 Screenshot 2020-09-23 at 15 21 23

missinglink avatar Sep 23 '20 14:09 missinglink

Yeah, I suspect there are two reasons why this was never implemented in the past:

  • a lot of our early libpostal work was done with little concern for venues, we were really thinking mostly about addresses
  • There are surely many cases where libpostal doesn't do a great job accurately detecting venues. Either false positives or false negatives would impact results in ways that are difficult to fix.

The first reason is obviously not a good one, but I imagine the hard part of actually merging this will be ensuring there aren't too many cases where, for example, something that is very much not a venue query, like one for an admin area or address, will be made worse.

orangejulius avatar Sep 23 '20 14:09 orangejulius

Right, so the question is "which parser does a better job of venues?" and the answer is "no" 😆

missinglink avatar Sep 23 '20 14:09 missinglink