Nominatim
Nominatim copied to clipboard
Consider using libpostal
Maybe it's worth to evaluate if an optional libpostal integration could improve search results. Talking to libpostal itself might be a bit too low level but we could use the same strategy as Pelias and call a REST wrapper: https://github.com/whosonfirst/go-whosonfirst-libpostal#wof-libpostal-server
This would probably solve issues like #759
I tried libpostal a few weeks ago, with some "German input"-combinations.
Input "[street] [house_number] [city]" was right, i thought amazing! But "[city] [street] [house_number]" was wrong, [city] was detected as "poi-name"
I'd sure expect bugs but those should be fixed by the libpostal project and not be treated by Nominatim.
https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)
https://pelias.io/index.html
Libpostal: Pelias uses the libpostal project for parsing addresses using the power of machine learning. Originally we loaded the 2GB of libpostal data directly in the API service, but this makes scaling harder and causes the API to take about 30 seconds to start, instead of a few milliseconds. We use a Go service built by the Who's on First team to make this happen quickly and efficiently.
https://github.com/openvenues/libpostal/issues/314
In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.
This is a big disadvantage for me, if you want to use Nominatim only with certain countries.
https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)
True but you would be limited to libpostal installed on the same server.
Possible problems:
- you want to run both Nominatim and Pelias with libpostal -> 2x memory usage
- Slow (re)start of Nominatim
- a complex Docker container build if you want to use libpostal
In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.
This is a big disadvantage for me, if you want to use Nominatim only with certain countries.
Apart from memory consumption i don't see a problem here. It's better to rely on a general model which is properly tested instead of using an error prone specialized one.
"wof-libpostal-server" or "a complex Docker" is not required, only 2 GB more RAM on same server
- https://github.com/openvenues/libpostal
- https://github.com/openvenues/php-postal
I had problems with PKG under CentOS7
pkg-config --cflags --libs libpostal
Environment variable was necessary
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
then activate the extension in the /etc/php.ini or /etc/php.d/postal.ini
and restart sudo systemctl restart httpd
only the Httpd start takes longer than usual, php-postal answers instantly
some tests with
Postal\Parser::parse_address( {string_input} )
#1132
Nordstraße 5, 27476 Cuxhaven
Nordstraße 3, 27476, Cuxhaven
Nordstraße 3 27476, Cuxhaven
output
1 road = nordstraße house_number = 5 postcode = 27476 city = cuxhaven
2 and 3 road = nordstraße house_number = 3 postcode = 27476 city = cuxhaven
for structured search ? https://nominatim.openstreetmap.org/search.php?&street=Nordstra%C3%9Fe+3&city=Cuxhaven&postalcode=27476
but i dont know yet how it can solve issues like #759
The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.
Instead you should look at an HTTP call to the golang libpostal worker.
However, structured search in Nominatim is still experimental and in most cases ?q=
fares better so libpostal's value addition is limited.
for structured search ?
That was my intention.
but i dont know yet how it can solve issues like #759
Maybe not spelling issues but it could help with certain omissions/abbreviations/additions which are not or not really well handled by Nominatim itself.
These two articles explain the benefits of libpostal quite good: https://machinelearnings.co/statistical-nlp-on-openstreetmap-b9d573e6cc86 https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718
The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.
Ok thanks, on my CentOS7-VM with 16 GB RAM.
free -h
total used free shared buff/cache available
Mem: 15G 3,0G 1,8G 2,1G 10G 10G
Swap: 127M 127M 8K
and
htop
(idle)
I think no problem for us, we have maybe sometimes 10 users at same time. Check load tests below.
Soap-UI, php-postal / httpd Load-Test -> 100 request every second (screenshots below)
also 5000 request every second no problems
- avg = 50 ms response time (+30ms longer)
- more httpd processes (worker?)
- CPU at 10-50 % Load
- same RAM 5G / 15G, i don't know why no differents with more workers / requests
It is designed as a shared PHP extension. Maybe it's only loaded once on apache startup? You should check if and how much it affects apache service startup time.
yes longer startup time, my post from yesterday
.... and restart
sudo systemctl restart httpd
only the Httpd start takes longer than usual, php-postal answers instantly
That would be acceptable IMHO. @lonvia what do you think about optional libpostal integration via https://github.com/openvenues/php-postal
we will try libpostal with nominatim (only German-OSM-Data) and i can post our experience later
gopi-ar
However, structured search in Nominatim is still experimental and in most cases ?q= fares better so libpostal's value addition is limited.
https://wiki.openstreetmap.org/wiki/Nominatim
(Commas are optional, but improve performance by reducing the complexity of the search.)
street= [housenumber] [streetname] city=[city] county=[county] state=[state] country=[country] postalcode=[postalcode]
https://nominatim.openstreetmap.org/search.php?&street=3%20Nordstra%C3%9Fe&city=Cuxhaven&postalcode=27476&country=Deutschland
Input = "3 Nordstraße, Cuxhaven, 27476, Deutschland"
"structured search"-params initialized "q"
-param to "[housenumber] [streetname], [city], [postalcode], [country]"
If you want to use libpostal with Nominatim in this way, you should replace the entire mechanism that creates interpretations of the search query. That means creating one or more SearchDescription objects from the libpostal output, calling query() on it and then filter and rank the results appropriately.
Pelias uses libpostal and it doesn't work right all the time. They are currently investigating how to bypass it in some cases.
https://github.com/pelias/pelias/issues/766
https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal https://github.com/pramsey/pgsql-postal
Libpostal as a postgres extension.
Libpostal integration would be a very good addition, hopefully it can make it to 5.0.0.