Aaron Straup Cope
Aaron Straup Cope
That _should_ be accounted for in `go-whosonfirst-export` but it's possible there are outstanding bugs or edge-cases: * https://github.com/whosonfirst/go-whosonfirst-export/blob/0d4f48214076d8c4d2cf1e9a21dc2c042a781d22/export.go#L46 * https://github.com/whosonfirst/go-whosonfirst-export/blob/0d4f48214076d8c4d2cf1e9a21dc2c042a781d22/export_test.go#L377
For dry-run testing if nothing else: - https://github.com/rainycape/unidecode - https://pypi.python.org/pypi/Unidecode/
The (Go) unidecode package seems to do okay, until it doesn't... ``` 85667819,Gjirokastër,Gjirokaster 85667821,Dibër,Diber 85667829,Lezhë,Lezhe 85667831,Durrës,Durres 85667835,Finström,Finstrom 85667849,Eckerö,Eckero 85667857,Föglö,Foglo 85667783,Durrës,Durres 85667793,Shkodër,Shkoder 85667797,Kukës,Kukes 85667945,Sant Julià de Lòria,Sant Julia De Loria 85667885,Vårdö,Vardo...
For example: https://en.wikipedia.org/wiki/Kalbajar is not `K@Lb@C@R` ...
Also-er: http://search.cpan.org/~sburke/Text-Unidecode/lib/Text/Unidecode.pm Because do you really trust Python to do a better job of Unicode than Perl... ?
``` $> wc -l go-unidecode-20160108.csv 55151 go-unidecode-20160108.csv ``` Which is to say: There are 55K records whose `wof:name` doesn't equal the unidecode-ed version. Assuming that it's possible to safely assume...
See also: https://github.com/mapzen/vector-datasource/issues/418
Of possible use, there is a truncated and un-concordified version of the `allCountries.txt` file with `asciiname` values. http://whosonfirst.mapzen.com.s3.amazonaws.com/misc/geonames-20160219.csv.bz2 See also: https://github.com/whosonfirst/go-whosonfirst-csvdb
Notes to self: I've had a conversation with Al Barrantine to see whether the transliteration code in libpostal can be exposed as a discrete piece of functionality. It can't be...
Basically use this (to determine what can be assumed to be Latin-1 or ASCII): https://github.com/openvenues/libpostal/blob/master/resources/language/countries/country_language.tsv