dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

clean: clean_address cannot recognize addresses with a single building name

Open NoirTree opened this issue 3 years ago • 5 comments

Describe the bug Some addresses with a single building name cannot be recognized by clean_address even if must_contain is set to null.

To reproduce from this dataset, or see the following demo:

df = pd.read_csv("./address/Chinese Delivery Drive/Chopsticks_Data_Raw.csv", sep = ',', encoding = "ISO-8859-1")
clean_headers(df)
df2 = clean_address(df, 'Street_Address', must_contain = ()) # set must_contain to null
df2 = df2.loc[:, ['Street_Address', 'Street_Address_clean']]
df_diff = df2[df2['Street_Address'] != df2['Street_Address_clean']].drop_duplicates()
df_diff[df_diff.Street_Address_clean.isnull()]

image For example, "Roper St. Francis Hospital" is set to NaN, but I think it is a reasonable building name. I wonder how clean_address and validate_address judge whether an address is valid or not? I couldn't find an explanation in the document.

NoirTree avatar Apr 27 '21 07:04 NoirTree

Thank you Jingxuan @NoirTree for figuring out this bug. Hi, Ryan~ @ryanwdale, could you please help Jingxuan @NoirTree to check this bug?

qidanrui avatar Apr 27 '21 08:04 qidanrui

Hi @NoirTree, thanks for making this issue. clean_address() and validate_address() use the usaddress library for parsing addresses, first the usaddress library is used to parse the address into attributes then some additional cleaning is performed and the address is converted into the desired output format. The usaddress library definitely does most of the heavy lifting here.

import usaddress
usaddress.tag("Roper St. Francis Hospital")

This is what clean_address() does and it gives us, (OrderedDict([('Recipient', 'Roper St. Francis Hospital')]), 'Ambiguous'). There's only one address attribute returned and it's labeled as the 'Recipient', after cleaning the address's attributes and converting to the output format we're left with an empty string so NaN is returned. I definitely agree that it would be nice if the function had the behaviour you're describing but since usaddress.tag() returns the wrong result in this case I'm not sure what we could do, any suggestions?

ryanwdale avatar Apr 27 '21 08:04 ryanwdale

Hi, @ryanwdale! Thanks for your explanation! It seems that we need to find out other methods (like other libraries) to make up for these deficiencies. But I'm not familiar with libraries with expected functions. Do you have any ideas?

NoirTree avatar Apr 27 '21 09:04 NoirTree

libpostal is a library we were considering using to clean addresses from countries other than the US, I'm not sure what it's behaviour is in this case though. I don't think there are many other good, free options.

ryanwdale avatar Apr 27 '21 21:04 ryanwdale

The library looks really attracting! Let me try it out. Thanks for your suggestions!

NoirTree avatar Apr 29 '21 09:04 NoirTree