libpostal
libpostal copied to clipboard
How to train our own address
Hi all,
Sorry to raise this as an issue. We are planning to train our own address (osm format). Is there any document to run osm_address_training_data.py
Any information would be useful.
Thank you.
See: https://github.com/openvenues/libpostal/issues/100#issuecomment-238887583.
osm_address_training_data.py
is only one piece of the preprocessing pipeline. Building the training set takes > 32G of RAM, > 100G of disk space and ~6 days of compute time currently, more with the upcoming addition of OpenAddresses. From there training the actual model is fast, takes less than a day to do 5 iterations on 100M examples, but I'm not going to document the version in master as it's already obsolete. When parser-data is ready to merge, and the process is automated and wrapped up into a script, it will be better documented, though I'm still not encouraging people to train models on proprietary data (and will only provide support for the standard models as I really have no way to diagnose what went wrong otherwise).
There is an early version of the new model that's backward-compatible with the code in master (far better at place names, etc. and handles unit phrases like "Flat 25" in English and Spanish). This can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz, if you want to try it out. To use that (doesn't require switching branches or anything, it's the same model in master trained on new data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share I believe.
Thank you for the info.
@thatdatabaseguy thank you for the new parser. Just tried it and is looking very promising.
Our venue and address information we was trying to train is not proprietary. We bundled them to osm formats to use them to train libpostal. However I can see that it is not that simple to train ourselves. Any alternatives are appreciated.
Ah, are they currently/able to be published somewhere? If so, and especially if they're already in OSM format, it might be simple to add them to the model.
Haven't published it out. We should be able to publish them. Was looking at http://wiki.openstreetmap.org/wiki/Beginners_Guide_1.4.2 but don't seem to easily do bulk updates directly.
@pasupulaphani it doesn't have to be through OSM. If you publish the file temporarily to somewhere like S3 (have to set --acl=public-read), I can copy it over to libpostal's S3 bucket and have a look at them for the current/next batch of training data. We can always use more venues.
@pasupulaphani libpostal 1.0 is now merged into master, which is a massive improvement both on the original and the intermediate sample. No need to download special models, just pull latest and run make
.
Let me know if you still have some venues you'd like to add.
For chinese places precisely to room, accuracy of libpostal seems to peform badly.
Did the pipeline script ever get published or the process documented?