ncov Fill in more lat/long data from the OpenStreetMap name-to-location API

Increase coverage for default location to latlong mapping

After running all GISAID samples through ncov's location translation pipeline, our system tries associate a lat/long with each location in the resulting list. The original lat_longs.tsv file maps approximately 50% of locations to a lat/long.

Hoping to increase coverage, I wrote a script to fetch a lat/long from the OpenStreetMap name-to-location search API and this is the resulting file. It gets us up to approximately 75% of locations with a valid lat/long.

These changes are entirely additive - no existing rows have been removed or modified, so it shouldn't produce any backwards-incompatible issues.

Related issue(s)

Fixes # Related to #

Testing

What steps should be taken to test the changes you've proposed? If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Apr 15 '22 17:04 jgadling

Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy. Just as a couple examples from https://github.com/nextstrain/ncov-ingest/blob/master/source-data/gisaid_annotations.tsv:

Germany/NW-HHU-1083/2021	EPI_ISL_1346721	location	Duesseldorf # previously  (Dusseldorf Health department)
Germany/NW-HHU-3865/2021	EPI_ISL_1990663	location	Duesseldorf # previously  (Düsseldorf Health department (interpreted as patient residence))
USA/ID-BVAMC-740558/2021	EPI_ISL_3156490	location	Ada County # additional_location_info: Ada
USA/ID-BVAMC-740611/2021	EPI_ISL_3156499	location	Bonneville County # additional_location_info: Bonneville
USA/IL-C21WGS0591/2021	EPI_ISL_1965028	location	Kenosha County # previously  (Kenosha County)

Lots of issues with US counties sometimes having "County" and sometimes not.

However, we stopped trying to standardize as there was just too much human label involved relative to payoff.

Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?

Apr 15 '22 17:04 trvrb

I've updated this PR against the latest locations db - the first one was based on an older file, sorry.

Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy.

Honestly I was unaware of the additional_location_info metadata until you mentioned it. The additional locations in this PR are currently just a brute-force mapping of country/division/location into the OSM country/state/city search fields, and recording whether we have a match.

Lots of issues with US counties sometimes having "County" and sometimes not.

Yes, and that's not even half of it - many locations in the post-ncov-filtered db have non-standard formats like Prague 1 or Butler County AL. I have a much more aggressive version of the import script that just keeps dropping words off the end of the location field.

I'm not sure what's preferred here - I can update the PR with the output of the more aggressive inclusion script, I can update it to use the additional_location_info data, or really anything else.

However, we stopped trying to standardize as there was just too much human label involved relative to payoff.

Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?

That happens somewhat but it doesn't tend to be a major problem since the OSM search tool Is fairly fussy and doesn't fix/hide/handle misspellings.

The much bigger problem here is actually the format of this TSV file - I know there's a fair amount of work put into the locations translation scripts to try to make location names globally unique, but they don't wind up being that unique in the end, so we'll have multiple rows in our DB with the same location name but different countries (ex: USA/Mississippi/Union vs Argentina/San Luis/Union), so if we're importing locations from this tsv file, we have to decide whether representing location\tUnion as the Mississipi vs Argentina makes sense for us. I think we'd be better served by including country, division, and location for every line in this file.

Apr 18 '22 23:04 jgadling

ncov ncov copied to clipboard

Fill in more lat/long data from the OpenStreetMap name-to-location API

Increase coverage for default location to latlong mapping

Related issue(s)

Testing

ncov
ncov copied to clipboard