pygris batch_geocode fails to parse the csv if the first address is not matched

batch_geocode fails to parse the csv if the first address is not matched

Open seamus-mckinsey opened this issue 11 months ago • 0 comments

Problem

Because the US Census Batch Geocode API returns a csv with no headers, pandas infers the number of columns from the first row. Unfortunately only three columns are returned for addresses that aren't matched. So if the first row in the returned csv is a no match, the csv will fail to parse with this error:

ParserError: Error tokenizing data. C error: Expected 3 fields in line 26, saw 12

Minimum reproducible example addresses_with_bad_address_first.csv:

1,1 bad place,Washington,DC,20001
2,1600 Pennsylvania Avenue NW,Washington,DC,20500

Trying to batch geocode:

import pandas as pd
from pygris.geocode import batch_geocode

df_with_problem = pd.read_csv('addresses_with_bad_address_first.csv')
batch_geocode(df_with_problem, address='address', city='city', state='state', zip='zip', id_column = 'unique_id')

That fails with the ParserError error shown above.

Proposed Solution

We could solve this problem by passing the expected columnnames to the pd.read_csv function. See a minimum example below: Generate the csv via curl:

curl --form addressFile=@addresses_with_bad_address_first.csv --form benchmark=2020 --form vintage=2020 https://geocoding.geo.census.gov/geocoder/geographies/addressbatch --output geocoderesult.csv

This is the output from the API:

"1","1 bad place, Washington, DC, 20001","No_Match"
"2","1600 Pennsylvania Avenue NW, Washington, DC, 20500","Match","Exact","1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500","-77.03518753691037,38.89869893251995","76225813","L","11","001","980000","1034"

Read the csv into python:

names = ['id', 'address', 'status', 'match_quality', 'matched_address', 'coordinates', 'tiger_line_id', 'tiger_side', 
                      'state', 'county', 'tract', 'block']
df_with_problem = pd.read_csv('geocoderesult.csv', sep = ",", header = None, quoting = csv.QUOTE_ALL, names = names)

This is what df_with_problem looks like:

id	address	status	match_quality	matched_address	coordinates	tiger_line_id	tiger_side	state	county	tract	block
1	1 bad place, Washington, DC, 20001	No_Match	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1600 Pennsylvania Avenue NW, Washington, DC, 2...	Match	Exact	1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500	-77.03518753691037,38.89869893251995	76225813.0	L	11.0	1.0	980000.0	1034.0

The relevant code is: https://github.com/walkerke/pygris/blob/67ffcfc69a4f899cc668b1b9cf6426ee211822a4/pygris/geocode.py#L307-L311

Mar 01 '24 16:03 seamus-mckinsey

pygris pygris copied to clipboard

batch_geocode fails to parse the csv if the first address is not matched

Problem

Proposed Solution

pygris
pygris copied to clipboard