ncov
ncov copied to clipboard
Fill in more lat/long data from the OpenStreetMap name-to-location API
Increase coverage for default location to latlong mapping
After running all GISAID samples through ncov's location translation pipeline, our system tries associate a lat/long with each location in the resulting list. The original lat_longs.tsv file maps approximately 50% of locations to a lat/long.
Hoping to increase coverage, I wrote a script to fetch a lat/long from the OpenStreetMap name-to-location search API and this is the resulting file. It gets us up to approximately 75% of locations with a valid lat/long.
These changes are entirely additive - no existing rows have been removed or modified, so it shouldn't produce any backwards-incompatible issues.
Related issue(s)
Fixes # Related to #
Testing
What steps should be taken to test the changes you've proposed? If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?
Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy. Just as a couple examples from https://github.com/nextstrain/ncov-ingest/blob/master/source-data/gisaid_annotations.tsv:
Germany/NW-HHU-1083/2021 EPI_ISL_1346721 location Duesseldorf # previously (Dusseldorf Health department)
Germany/NW-HHU-3865/2021 EPI_ISL_1990663 location Duesseldorf # previously (Düsseldorf Health department (interpreted as patient residence))
USA/ID-BVAMC-740558/2021 EPI_ISL_3156490 location Ada County # additional_location_info: Ada
USA/ID-BVAMC-740611/2021 EPI_ISL_3156499 location Bonneville County # additional_location_info: Bonneville
USA/IL-C21WGS0591/2021 EPI_ISL_1965028 location Kenosha County # previously (Kenosha County)
Lots of issues with US counties sometimes having "County" and sometimes not.
However, we stopped trying to standardize as there was just too much human label involved relative to payoff.
Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?
I've updated this PR against the latest locations db - the first one was based on an older file, sorry.
Cool! How did you handle issues with different names in GISAID sequences? We'd been attempting to standardize locations for while which tend to be quite messy.
Honestly I was unaware of the additional_location_info metadata until you mentioned it. The additional locations in this PR are currently just a brute-force mapping of country/division/location
into the OSM country/state/city
search fields, and recording whether we have a match.
Lots of issues with US counties sometimes having "County" and sometimes not.
Yes, and that's not even half of it - many locations in the post-ncov-filtered db have non-standard formats like Prague 1
or Butler County AL
. I have a much more aggressive version of the import script that just keeps dropping words off the end of the location field.
I'm not sure what's preferred here - I can update the PR with the output of the more aggressive inclusion script, I can update it to use the additional_location_info
data, or really anything else.
However, we stopped trying to standardize as there was just too much human label involved relative to payoff.
Here, do you end up with a bunch of the same lat/longs for the slight different spellings of the same location?
That happens somewhat but it doesn't tend to be a major problem since the OSM search tool Is fairly fussy and doesn't fix/hide/handle misspellings.
The much bigger problem here is actually the format of this TSV file - I know there's a fair amount of work put into the locations translation scripts to try to make location names globally unique, but they don't wind up being that unique in the end, so we'll have multiple rows in our DB with the same location name but different countries (ex: USA/Mississippi/Union
vs Argentina/San Luis/Union
), so if we're importing locations from this tsv
file, we have to decide whether representing location\tUnion
as the Mississipi vs Argentina makes sense for us. I think we'd be better served by including country, division, and location for every line in this file.