pygris icon indicating copy to clipboard operation
pygris copied to clipboard

batch_geocode fails to parse the csv if the first address is not matched

Open seamus-mckinsey opened this issue 11 months ago • 0 comments

Problem

Because the US Census Batch Geocode API returns a csv with no headers, pandas infers the number of columns from the first row. Unfortunately only three columns are returned for addresses that aren't matched. So if the first row in the returned csv is a no match, the csv will fail to parse with this error:

ParserError: Error tokenizing data. C error: Expected 3 fields in line 26, saw 12

Minimum reproducible example addresses_with_bad_address_first.csv:

1,1 bad place,Washington,DC,20001
2,1600 Pennsylvania Avenue NW,Washington,DC,20500

Trying to batch geocode:

import pandas as pd
from pygris.geocode import batch_geocode

df_with_problem = pd.read_csv('addresses_with_bad_address_first.csv')
batch_geocode(df_with_problem, address='address', city='city', state='state', zip='zip', id_column = 'unique_id')

That fails with the ParserError error shown above.

Proposed Solution

We could solve this problem by passing the expected columnnames to the pd.read_csv function. See a minimum example below: Generate the csv via curl:

curl --form addressFile=@addresses_with_bad_address_first.csv --form benchmark=2020 --form vintage=2020 https://geocoding.geo.census.gov/geocoder/geographies/addressbatch --output geocoderesult.csv

This is the output from the API:

"1","1 bad place, Washington, DC, 20001","No_Match"
"2","1600 Pennsylvania Avenue NW, Washington, DC, 20500","Match","Exact","1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500","-77.03518753691037,38.89869893251995","76225813","L","11","001","980000","1034"

Read the csv into python:

names = ['id', 'address', 'status', 'match_quality', 'matched_address', 'coordinates', 'tiger_line_id', 'tiger_side', 
                      'state', 'county', 'tract', 'block']
df_with_problem = pd.read_csv('geocoderesult.csv', sep = ",", header = None, quoting = csv.QUOTE_ALL, names = names)

This is what df_with_problem looks like:

id address status match_quality matched_address coordinates tiger_line_id tiger_side state county tract block
1 1 bad place, Washington, DC, 20001 No_Match NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1600 Pennsylvania Avenue NW, Washington, DC, 2... Match Exact 1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500 -77.03518753691037,38.89869893251995 76225813.0 L 11.0 1.0 980000.0 1034.0

The relevant code is: https://github.com/walkerke/pygris/blob/67ffcfc69a4f899cc668b1b9cf6426ee211822a4/pygris/geocode.py#L307-L311

seamus-mckinsey avatar Mar 01 '24 16:03 seamus-mckinsey