dataprep
dataprep copied to clipboard
clean: clean_address cannot recognize addresses with a single building name
Describe the bug
Some addresses with a single building name cannot be recognized by clean_address
even if must_contain
is set to null.
To reproduce from this dataset, or see the following demo:
df = pd.read_csv("./address/Chinese Delivery Drive/Chopsticks_Data_Raw.csv", sep = ',', encoding = "ISO-8859-1")
clean_headers(df)
df2 = clean_address(df, 'Street_Address', must_contain = ()) # set must_contain to null
df2 = df2.loc[:, ['Street_Address', 'Street_Address_clean']]
df_diff = df2[df2['Street_Address'] != df2['Street_Address_clean']].drop_duplicates()
df_diff[df_diff.Street_Address_clean.isnull()]
For example, "Roper St. Francis Hospital" is set to NaN, but I think it is a reasonable building name.
I wonder how
clean_address
and validate_address
judge whether an address is valid or not? I couldn't find an explanation in the document.
Thank you Jingxuan @NoirTree for figuring out this bug. Hi, Ryan~ @ryanwdale, could you please help Jingxuan @NoirTree to check this bug?
Hi @NoirTree, thanks for making this issue. clean_address()
and validate_address()
use the usaddress library for parsing addresses, first the usaddress library is used to parse the address into attributes then some additional cleaning is performed and the address is converted into the desired output format. The usaddress library definitely does most of the heavy lifting here.
import usaddress
usaddress.tag("Roper St. Francis Hospital")
This is what clean_address() does and it gives us, (OrderedDict([('Recipient', 'Roper St. Francis Hospital')]), 'Ambiguous')
. There's only one address attribute returned and it's labeled as the 'Recipient', after cleaning the address's attributes and converting to the output format we're left with an empty string so NaN is returned. I definitely agree that it would be nice if the function had the behaviour you're describing but since usaddress.tag()
returns the wrong result in this case I'm not sure what we could do, any suggestions?
Hi, @ryanwdale! Thanks for your explanation! It seems that we need to find out other methods (like other libraries) to make up for these deficiencies. But I'm not familiar with libraries with expected functions. Do you have any ideas?
libpostal is a library we were considering using to clean addresses from countries other than the US, I'm not sure what it's behaviour is in this case though. I don't think there are many other good, free options.
The library looks really attracting! Let me try it out. Thanks for your suggestions!