libpostal icon indicating copy to clipboard operation
libpostal copied to clipboard

How to train our own address

Open pasupulaphani opened this issue 8 years ago • 9 comments

Hi all,

Sorry to raise this as an issue. We are planning to train our own address (osm format). Is there any document to run osm_address_training_data.py

Any information would be useful.

Thank you.

pasupulaphani avatar Sep 22 '16 20:09 pasupulaphani

See: https://github.com/openvenues/libpostal/issues/100#issuecomment-238887583.

osm_address_training_data.py is only one piece of the preprocessing pipeline. Building the training set takes > 32G of RAM, > 100G of disk space and ~6 days of compute time currently, more with the upcoming addition of OpenAddresses. From there training the actual model is fast, takes less than a day to do 5 iterations on 100M examples, but I'm not going to document the version in master as it's already obsolete. When parser-data is ready to merge, and the process is automated and wrapped up into a script, it will be better documented, though I'm still not encouraging people to train models on proprietary data (and will only provide support for the standard models as I really have no way to diagnose what went wrong otherwise).

There is an early version of the new model that's backward-compatible with the code in master (far better at place names, etc. and handles unit phrases like "Flat 25" in English and Spanish). This can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz, if you want to try it out. To use that (doesn't require switching branches or anything, it's the same model in master trained on new data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share I believe.

albarrentine avatar Sep 22 '16 21:09 albarrentine

Thank you for the info.

pasupulaphani avatar Sep 23 '16 01:09 pasupulaphani

@thatdatabaseguy thank you for the new parser. Just tried it and is looking very promising.

Our venue and address information we was trying to train is not proprietary. We bundled them to osm formats to use them to train libpostal. However I can see that it is not that simple to train ourselves. Any alternatives are appreciated.

pasupulaphani avatar Sep 23 '16 02:09 pasupulaphani

Ah, are they currently/able to be published somewhere? If so, and especially if they're already in OSM format, it might be simple to add them to the model.

albarrentine avatar Sep 23 '16 03:09 albarrentine

Haven't published it out. We should be able to publish them. Was looking at http://wiki.openstreetmap.org/wiki/Beginners_Guide_1.4.2 but don't seem to easily do bulk updates directly.

pasupulaphani avatar Sep 26 '16 09:09 pasupulaphani

@pasupulaphani it doesn't have to be through OSM. If you publish the file temporarily to somewhere like S3 (have to set --acl=public-read), I can copy it over to libpostal's S3 bucket and have a look at them for the current/next batch of training data. We can always use more venues.

albarrentine avatar Jan 18 '17 22:01 albarrentine

@pasupulaphani libpostal 1.0 is now merged into master, which is a massive improvement both on the original and the intermediate sample. No need to download special models, just pull latest and run make.

Let me know if you still have some venues you'd like to add.

albarrentine avatar Apr 07 '17 01:04 albarrentine

For chinese places precisely to room, accuracy of libpostal seems to peform badly.

xwhsky avatar Apr 15 '19 12:04 xwhsky

Did the pipeline script ever get published or the process documented?

brianmacy avatar Jun 20 '20 18:06 brianmacy