python-us icon indicating copy to clipboard operation
python-us copied to clipboard

Easier way to apply to DataFrames

Open MaxGhenis opened this issue 5 years ago • 5 comments

In mapping a pandas DataFrame's numeric fips index to state name, I currently have to do this:

state['state'] = pd.Series(state.index).apply(
    lambda x: us.states.lookup(str(x).zfill(2)).name).tolist()

There are a couple things going on here that could be other issues*, but might also be nice to have the vectorization built-in for pandas. e.g. I'd like to be able to just do: state['state'] = us.states.lookup(state.index)

* This one was verbose because things like us.states.lookup(1) and us.states.lookup('1') fail.

MaxGhenis avatar Apr 28 '19 06:04 MaxGhenis

Thanks for the input, Max! Can you elaborate a bit on what the content of what your state.index column looks like? State FIPS codes must be strings and must be zero-padded to a length of two (federal government examples: eg, eg), similar to ZIP codes, which is why you're getting those lookup failures.

mileswwatkins avatar May 01 '19 20:05 mileswwatkins

state.index was an int, but abstracting out the zero-padded string piece, passing that also requires using a map or apply statement, e.g. if fips_char is a zero-padded string pandas Series, getting a new Series with the state would require:

state.fips_char.map(us.states.lookup)

It'd be cool if it could instead be:

us.states.lookup(state.fips_char)

That is, if a Series (or numpy array) is passed, do the map/apply as needed and return an object of the same input size/shape.

Separately, since CSVs from sources like IPUMS provide fips codes as ints, saving that step could be useful, but I could see it also making it more error-prone, especially for ZIPs, and it sounds like you've thought about that and made a design decision to require preliminary padding.

MaxGhenis avatar May 01 '19 21:05 MaxGhenis

This would be great to have—any updates on the status?

rchurt avatar Jul 12 '20 02:07 rchurt

Hey there! Great library, I've been using it in several projects at work. This issue caught my eye since we use Pandas a lot in my projects too, so I thought I'd weigh in.

Since this library is pretty lightweight (the only dependency I see is on jellyfish), I think it would be better to have guidance on how to quickly create a DataFrame from it:

import pandas as pd
from us.states import STATES
us_df = pd.DataFrame([vars(st) for st in STATES])

From there, I assume you wouldn't be doing a 'lookup' unless you had another dataset you were trying to merge with, but if that's the case then you would use something like df.merge now that you have this library's records as a DataFrame.

If it so happens to be that your dataset has numeric FIPS codes (this would be 'dirty' data though, since as @mileswwatkins stated, FIPS codes are officially zero-padded strings), you can either:

  1. convert the padded-string FIPS codes to numeric prior to merging:
us_df['fips'] = pd.to_numeric(us_df['fips'])

or 2) convert your data to padded-string FIPS codes prior to merging:

my_df['fips'] = my_df['fips'].str.pad(width=2, fillchar='0')

Alternatively, if you know you just need to convert from one representation to another and don't need all of the information from this library as a DataFrame, rather than doing a lookup one at a time, you can use this library's mapping function to quickly convert from one representation to another:

from us.states import mapping
my_df["abbr"] = my_df["fips"].map(mapping('fips', 'abbr'))

In my opinion: since the zero-padded version is the official designation, that's what this library should maintain, and it shouldn't make assumptions about what libraries downstream consumers are using (consider if someone else were using PySpark or Dask instead), but documentation/examples are always good to have (perhaps in a 'Recipes' guide)

Hope this helps!

wolfc86 avatar Feb 07 '21 19:02 wolfc86

I've also had this problem and converted from int to a string with leading zeros as a workaround.

The issue with this, however, is that I caught the fact that some of the lookups didn't return any values only later on. There was no error, passing person["statefip"].astype(str).apply(lambda x: us.states.lookup(x)) in this case returned a series with 44 unique states values, instead of 51, with no warning/error about the failed lookups

fedderw avatar Apr 30 '21 19:04 fedderw