Nominatim icon indicating copy to clipboard operation
Nominatim copied to clipboard

Consider using libpostal

Open otbutz opened this issue 6 years ago • 17 comments

Maybe it's worth to evaluate if an optional libpostal integration could improve search results. Talking to libpostal itself might be a bit too low level but we could use the same strategy as Pelias and call a REST wrapper: https://github.com/whosonfirst/go-whosonfirst-libpostal#wof-libpostal-server

This would probably solve issues like #759

otbutz avatar Jul 26 '18 12:07 otbutz

I tried libpostal a few weeks ago, with some "German input"-combinations.

Input "[street] [house_number] [city]" was right, i thought amazing! But "[city] [street] [house_number]" was wrong, [city] was detected as "poi-name"

Alex2782 avatar Jul 31 '18 09:07 Alex2782

I'd sure expect bugs but those should be fixed by the libpostal project and not be treated by Nominatim.

otbutz avatar Jul 31 '18 09:07 otbutz

https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)

https://pelias.io/index.html

Libpostal: Pelias uses the libpostal project for parsing addresses using the power of machine learning. Originally we loaded the 2GB of libpostal data directly in the API service, but this makes scaling harder and causes the API to take about 30 seconds to start, instead of a few milliseconds. We use a Go service built by the Who's on First team to make this happen quickly and efficiently.

https://github.com/openvenues/libpostal/issues/314

In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.

This is a big disadvantage for me, if you want to use Nominatim only with certain countries.

Alex2782 avatar Jul 31 '18 10:07 Alex2782

https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)

True but you would be limited to libpostal installed on the same server.

Possible problems:

  • you want to run both Nominatim and Pelias with libpostal -> 2x memory usage
  • Slow (re)start of Nominatim
  • a complex Docker container build if you want to use libpostal

In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.

This is a big disadvantage for me, if you want to use Nominatim only with certain countries.

Apart from memory consumption i don't see a problem here. It's better to rely on a general model which is properly tested instead of using an error prone specialized one.

otbutz avatar Jul 31 '18 11:07 otbutz

"wof-libpostal-server" or "a complex Docker" is not required, only 2 GB more RAM on same server

  1. https://github.com/openvenues/libpostal
  2. https://github.com/openvenues/php-postal

I had problems with PKG under CentOS7 pkg-config --cflags --libs libpostal

Environment variable was necessary export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

then activate the extension in the /etc/php.ini or /etc/php.d/postal.ini and restart sudo systemctl restart httpd only the Httpd start takes longer than usual, php-postal answers instantly


some tests with Postal\Parser::parse_address( {string_input} )

#1132

Nordstraße 5, 27476 Cuxhaven
Nordstraße 3, 27476, Cuxhaven
Nordstraße 3 27476, Cuxhaven

output

1 road = nordstraße house_number = 5 postcode = 27476 city = cuxhaven

2 and 3 road = nordstraße house_number = 3 postcode = 27476 city = cuxhaven

for structured search ? https://nominatim.openstreetmap.org/search.php?&street=Nordstra%C3%9Fe+3&city=Cuxhaven&postalcode=27476

but i dont know yet how it can solve issues like #759

Alex2782 avatar Aug 21 '18 20:08 Alex2782

The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.

Instead you should look at an HTTP call to the golang libpostal worker.

However, structured search in Nominatim is still experimental and in most cases ?q= fares better so libpostal's value addition is limited.

gopi-ar avatar Aug 22 '18 07:08 gopi-ar

for structured search ?

That was my intention.

but i dont know yet how it can solve issues like #759

Maybe not spelling issues but it could help with certain omissions/abbreviations/additions which are not or not really well handled by Nominatim itself.

These two articles explain the benefits of libpostal quite good: https://machinelearnings.co/statistical-nlp-on-openstreetmap-b9d573e6cc86 https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718

otbutz avatar Aug 22 '18 07:08 otbutz

The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.

Ok thanks, on my CentOS7-VM with 16 GB RAM.

free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        3,0G        1,8G        2,1G         10G         10G
Swap:          127M        127M          8K

and htop (idle)

I think no problem for us, we have maybe sometimes 10 users at same time. Check load tests below.

htop

Soap-UI, php-postal / httpd Load-Test -> 100 request every second (screenshots below)

also 5000 request every second no problems

  1. avg = 50 ms response time (+30ms longer)
  2. more httpd processes (worker?)
  3. CPU at 10-50 % Load
  4. same RAM 5G / 15G, i don't know why no differents with more workers / requests

sopaui-test

htop_2

Alex2782 avatar Aug 22 '18 14:08 Alex2782

It is designed as a shared PHP extension. Maybe it's only loaded once on apache startup? You should check if and how much it affects apache service startup time.

otbutz avatar Aug 22 '18 16:08 otbutz

yes longer startup time, my post from yesterday

.... and restart sudo systemctl restart httpd only the Httpd start takes longer than usual, php-postal answers instantly

Alex2782 avatar Aug 22 '18 16:08 Alex2782

That would be acceptable IMHO. @lonvia what do you think about optional libpostal integration via https://github.com/openvenues/php-postal

otbutz avatar Aug 23 '18 07:08 otbutz

we will try libpostal with nominatim (only German-OSM-Data) and i can post our experience later

gopi-ar

However, structured search in Nominatim is still experimental and in most cases ?q= fares better so libpostal's value addition is limited.

https://wiki.openstreetmap.org/wiki/Nominatim

(Commas are optional, but improve performance by reducing the complexity of the search.)

street= [housenumber] [streetname] city=[city] county=[county] state=[state] country=[country] postalcode=[postalcode]

https://nominatim.openstreetmap.org/search.php?&street=3%20Nordstra%C3%9Fe&city=Cuxhaven&postalcode=27476&country=Deutschland

Input = "3 Nordstraße, Cuxhaven, 27476, Deutschland" "structured search"-params initialized "q"-param to "[housenumber] [streetname], [city], [postalcode], [country]"

Alex2782 avatar Aug 23 '18 10:08 Alex2782

If you want to use libpostal with Nominatim in this way, you should replace the entire mechanism that creates interpretations of the search query. That means creating one or more SearchDescription objects from the libpostal output, calling query() on it and then filter and rank the results appropriately.

lonvia avatar Aug 23 '18 21:08 lonvia

Pelias uses libpostal and it doesn't work right all the time. They are currently investigating how to bypass it in some cases.

https://github.com/pelias/pelias/issues/766

powerbilayeredmap avatar Jan 21 '19 21:01 powerbilayeredmap

https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal https://github.com/pramsey/pgsql-postal

Libpostal as a postgres extension.

arungowtham avatar May 16 '19 04:05 arungowtham

Libpostal integration would be a very good addition, hopefully it can make it to 5.0.0.

ghost avatar Oct 10 '22 12:10 ghost