pudl
pudl copied to clipboard
Incorrect county FIPS code for Bedford, VA
Describe the bug
The addfips
package is labeling Bedford, VA as '51515', which is the code for Bedford City. It should actually be '51019' (Bedford County). See their list of FIPS codes.
Bug Severity
How badly is this bug affecting you? Medium: I was able to identify and fix the bug in my own workflow but it might affect other people.
To Reproduce
I found the error in the core_eia861__yearly_service_territory
table. Census population files do not have the FIPS code 51515.
https://github.com/fitnr/addfips/issues/8 It looks like this particular issue was flagged by @TrentonBush almost a year ago, and addressed in the underlying package but not released. So perhaps we just need to bug them to cut a new release.
Update just got pushed! Should be a simple matter of updating dependencies, I'll throw this issue into this sprint.
As far as I can tell we're still waiting on the maintainer to merge their fix commit which apparently didn't make it into the release. I'll bump them again.
I guess we could also pin to their fix/8
branch but we'll see if they respond to my nag, first.
addfips
exists to do one job and it fails to do it. Considering the whole package is like 300 lines and the maintainer doesn't maintain it, I think we should replace it. One option is to simply vendor it, another would be to replace it with something like Google's geocoder, which is much more powerful. I have used Google geocoder in a client project for years with good results.
By Google's geocoder, you mean https://geocoder.readthedocs.io/index.html? Just poking around it seems like you'd need a TAMU key to pull FIPS codes out of county names. But it also seems like there's some federal APIs we could hit to get the FIPS codes?
I meant Google Maps Platform's Geocoding API. IMO the primary advantages are that:
- they already implemented fuzzy matching (good for manually entered data with misspellings)
- it can handle any granularity from street address or lat/lon up to country name.
The disadvantages I am aware of are:
- I don't think you can select a historical map to reference
- it will update the reference maps on its schedule, not yours
- if you're running it on every automated build, you'll need to make a caching layer or suffer network latency and per-call costs.
I use a cache layer and my usage always fits in the (generous) free tier. Occasionally cache invalidation issues cause minor annoyance, but it is easy to fix with a refresh.
Ah sweet! What do you do for a caching layer?
I also just spent a few minutes poking around at the documentation and couldn't see where FIPS code would get returned - unless that gets returned as the short_name
of an administrative_level_2
address component. Has that been your experience?
Ah ya I use this as a cleaning/standardization function to convert dirty inputs to the official county names. Then you can do a simple join against the official Census data to get FIPS codes. But you need both!
Also I now realize the work I was referencing is actually public, so I'll just link to it. Sorry in advance for the data scientist quality code 😇
- geocoder API calls and response parsing, with memory cache at the row level.
- disk cache (via joblib) applied at the dataframe level. The main API function is the next one down.
The row-level memory cache saves duplicate API calls per session (eg looking up the same county 1000 times), and the dataframe-level disk cache saves duplicate calls between runs (when a source dataset is unchanged).
I didn't automate the cache invalidation, I just do it manually because updates are infrequent. But the free tier resets each month, so a monthly clear could make sense.