python-us
python-us copied to clipboard
Easier way to apply to DataFrames
In mapping a pandas DataFrame's numeric fips index to state name, I currently have to do this:
state['state'] = pd.Series(state.index).apply(
lambda x: us.states.lookup(str(x).zfill(2)).name).tolist()
There are a couple things going on here that could be other issues*, but might also be nice to have the vectorization built-in for pandas. e.g. I'd like to be able to just do:
state['state'] = us.states.lookup(state.index)
* This one was verbose because things like us.states.lookup(1)
and us.states.lookup('1')
fail.
Thanks for the input, Max! Can you elaborate a bit on what the content of what your state.index
column looks like? State FIPS codes must be strings and must be zero-padded to a length of two (federal government examples: eg, eg), similar to ZIP codes, which is why you're getting those lookup failures.
state.index
was an int, but abstracting out the zero-padded string piece, passing that also requires using a map
or apply
statement, e.g. if fips_char
is a zero-padded string pandas Series
, getting a new Series
with the state would require:
state.fips_char.map(us.states.lookup)
It'd be cool if it could instead be:
us.states.lookup(state.fips_char)
That is, if a Series
(or numpy
array) is passed, do the map
/apply
as needed and return an object of the same input size/shape.
Separately, since CSVs from sources like IPUMS provide fips codes as ints, saving that step could be useful, but I could see it also making it more error-prone, especially for ZIPs, and it sounds like you've thought about that and made a design decision to require preliminary padding.
This would be great to have—any updates on the status?
Hey there! Great library, I've been using it in several projects at work. This issue caught my eye since we use Pandas a lot in my projects too, so I thought I'd weigh in.
Since this library is pretty lightweight (the only dependency I see is on jellyfish
), I think it would be better to have guidance on how to quickly create a DataFrame from it:
import pandas as pd
from us.states import STATES
us_df = pd.DataFrame([vars(st) for st in STATES])
From there, I assume you wouldn't be doing a 'lookup' unless you had another dataset you were trying to merge with, but if that's the case then you would use something like df.merge
now that you have this library's records as a DataFrame.
If it so happens to be that your dataset has numeric FIPS codes (this would be 'dirty' data though, since as @mileswwatkins stated, FIPS codes are officially zero-padded strings), you can either:
- convert the padded-string FIPS codes to numeric prior to merging:
us_df['fips'] = pd.to_numeric(us_df['fips'])
or 2) convert your data to padded-string FIPS codes prior to merging:
my_df['fips'] = my_df['fips'].str.pad(width=2, fillchar='0')
Alternatively, if you know you just need to convert from one representation to another and don't need all of the information from this library as a DataFrame, rather than doing a lookup one at a time, you can use this library's mapping
function to quickly convert from one representation to another:
from us.states import mapping
my_df["abbr"] = my_df["fips"].map(mapping('fips', 'abbr'))
In my opinion: since the zero-padded version is the official designation, that's what this library should maintain, and it shouldn't make assumptions about what libraries downstream consumers are using (consider if someone else were using PySpark or Dask instead), but documentation/examples are always good to have (perhaps in a 'Recipes' guide)
Hope this helps!
I've also had this problem and converted from int to a string with leading zeros as a workaround.
The issue with this, however, is that I caught the fact that some of the lookups didn't return any values only later on. There was no error, passing
person["statefip"].astype(str).apply(lambda x: us.states.lookup(x))
in this case returned a series with 44 unique states values, instead of 51, with no warning/error about the failed lookups