pygris
pygris copied to clipboard
batch_geocode fails to parse the csv if the first address is not matched
Problem
Because the US Census Batch Geocode API returns a csv with no headers, pandas infers the number of columns from the first row. Unfortunately only three columns are returned for addresses that aren't matched. So if the first row in the returned csv is a no match, the csv will fail to parse with this error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 26, saw 12
Minimum reproducible example addresses_with_bad_address_first.csv
:
1,1 bad place,Washington,DC,20001
2,1600 Pennsylvania Avenue NW,Washington,DC,20500
Trying to batch geocode:
import pandas as pd
from pygris.geocode import batch_geocode
df_with_problem = pd.read_csv('addresses_with_bad_address_first.csv')
batch_geocode(df_with_problem, address='address', city='city', state='state', zip='zip', id_column = 'unique_id')
That fails with the ParserError
error shown above.
Proposed Solution
We could solve this problem by passing the expected columnnames
to the pd.read_csv
function. See a minimum example below:
Generate the csv via curl:
curl --form addressFile=@addresses_with_bad_address_first.csv --form benchmark=2020 --form vintage=2020 https://geocoding.geo.census.gov/geocoder/geographies/addressbatch --output geocoderesult.csv
This is the output from the API:
"1","1 bad place, Washington, DC, 20001","No_Match"
"2","1600 Pennsylvania Avenue NW, Washington, DC, 20500","Match","Exact","1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500","-77.03518753691037,38.89869893251995","76225813","L","11","001","980000","1034"
Read the csv into python:
names = ['id', 'address', 'status', 'match_quality', 'matched_address', 'coordinates', 'tiger_line_id', 'tiger_side',
'state', 'county', 'tract', 'block']
df_with_problem = pd.read_csv('geocoderesult.csv', sep = ",", header = None, quoting = csv.QUOTE_ALL, names = names)
This is what df_with_problem
looks like:
id | address | status | match_quality | matched_address | coordinates | tiger_line_id | tiger_side | state | county | tract | block |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 bad place, Washington, DC, 20001 | No_Match | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1600 Pennsylvania Avenue NW, Washington, DC, 2... | Match | Exact | 1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500 | -77.03518753691037,38.89869893251995 | 76225813.0 | L | 11.0 | 1.0 | 980000.0 | 1034.0 |
The relevant code is: https://github.com/walkerke/pygris/blob/67ffcfc69a4f899cc668b1b9cf6426ee211822a4/pygris/geocode.py#L307-L311