usaddress icon indicating copy to clipboard operation
usaddress copied to clipboard

Address incorrectly parsed

Open waynebrantley opened this issue 7 years ago • 30 comments

This is a valid address - you can confirm it any address verification site.

3360 County Road F  Tekamah NE 68061

When AddressParser takes this, it parses that the CITY is F Tekamah Of course the CITY should be Tekamah and the STREET should be County Road F

Additionally it makes the STREET County Rd instead of County Road

Thoughts?

TODOs:

  • [x] Add Bulk Testing Method
  • [x] Add Failing Examples
  • [x] Add 10k samples
  • [ ] Solve failing examples (possibly with bigger refactorings)
    • [x] 3360 County Road F Tekamah NE 68061
    • [ ] 623 NE 5th AVE Fort Lauderdale FL 33304
    • [ ] Route 1 Gouldsboro ME 04607
    • [x] 3419 Avenue C Council Bluffs IA 51501
    • [ ] 65 Ginger Woods Valley NE 68064
    • [ ] Rte 175 Blue Hill Falls Rd Blue Hill ME 04614
    • [ ] 1302 LUCERNE AVENUE Lake Worth FL 33460
    • [ ] RR 1 Box 3145 Sedgwick ME 04676
  • [ ] Create plugin to allow city to be looked up from an external source (think this should be its own issue)

waynebrantley avatar Jun 14 '17 21:06 waynebrantley

My thoughts? I usually recall a coworker who one said something like "Regular expressions can't parse every address, because addresses aren't a regular language."

Nevertheless, my goal in keeping this library alive (I'm not the original author), is to have a regular expression that can parse a useful plurality of addresses. We just did a release in the last couple months. Did this case work for you in the past but break recently?

I don't doubt that your address is valid and I think the problem is that "Road" is considered a street address suffix, and that the city field is considered to be everything until we reach the state abbreviation. I think Brooklyn also had this problem a lot with "Avenue A" "Avenue B" and so on. I know there is a solution to the problem because we solved it at that company (don't work there anymore and can't remember all the details.)

So, please submit a unit test that includes this address example. If you know regex, please propose a solution that doesn't break any existing tests. If you aren't comfortable with regex, we may be able to reach out to the community for a solution.

jamesrcounts avatar Jun 14 '17 22:06 jamesrcounts

I am a new user of this library...because parsing addresses is HARD. :-)

waynebrantley avatar Jun 14 '17 22:06 waynebrantley

Note that 3197 COUNTY ROAD 31 is parsed properly FYI.

waynebrantley avatar Jun 14 '17 22:06 waynebrantley

Think the regex assumes cities don't have numbers in them. (Of course, some do in Puerto Rico, again--not regular).

jamesrcounts avatar Jun 14 '17 22:06 jamesrcounts

I do have a database of every city for every state (paid for subscription). One possible enhancement I thought of was using that to assist in the identification of when the city stopped. I am preparing some unit tests.

waynebrantley avatar Jun 14 '17 22:06 waynebrantley

Pretty sure the solution we (I) came up with for Brooklyn was to match "Avenue \w" pattern before trying to parse the address, then using a customized version of the regex from this library to handle that special case.

You can make custom regexes pretty easily because the "sub" regexes that make up the big regex are all publicly accessible. So if State and Zip are working for you then you can just use them, while customizing the street portion, or vice versa.

Of course, it would be nice to use your database, but I would be surprised if your subscription allowed that.

jamesrcounts avatar Jun 14 '17 22:06 jamesrcounts

Here are my current failing test cases.

        [TestCase("3360 County Road F Tekamah NE 68061", "3360 COUNTY ROAD F")]
        [TestCase("623 NE 5th AVE Fort Lauderdale FL 33304", "623 NE 5TH AVE")]
        [TestCase("Route 1  Gouldsboro ME 04607", "ROUTE 1")]
        [TestCase("3419 Avenue C Council Bluffs IA 51501", "3419 AVENUE C")]
        [TestCase("65 Ginger Woods  Valley NE 68064", "65 GINGER WOODS")]
        [TestCase("Rte 175 Blue Hill Falls Rd  Blue Hill ME 04614", "RTE 175 BLUE HILL FALLS RD")]
        [TestCase("1302 LUCERNE AVENUE  Lake Worth FL 33460", "1302 LUCERNE AVENUE")]
        [TestCase("RR 1 Box 3145  Sedgwick ME 04676", "RR 1 BOX 3145")]  //regex expects address to start with a number?
        //[TestCase("")]
       
        public void StreetLineTests(string address, string streetLine)
        {
            var parser = new AddressParser();
            var result = parser.ParseAddress(address, true);
            Assert.That(result, Is.Not.Null);
            Assert.That(result.StreetLine, Is.EqualTo(streetLine));
        }

        [TestCase("3360 County Road F Tekamah NE 68061", "TEKAMAH")]
        [TestCase("623 NE 5th AVE Fort Lauderdale FL 33304", "FORT LAUDERDALE")]
        [TestCase("Route 1  Gouldsboro ME 04607", "GOULDSBORO")]
        [TestCase("3419 Avenue C Council Bluffs IA 51501", "COUNCIL BLUFFS")]
        [TestCase("65 Ginger Woods  Valley NE 68064", "VALLEY")]
        [TestCase("1302 LUCERNE AVENUE  Lake Worth FL 33460", "LAKE WORTH")]
        //[TestCase("")]

        public void CityTests(string address, string city)
        {
            var parser = new AddressParser();
            var result = parser.ParseAddress(address, true);
            Assert.That(result, Is.Not.Null);
            Assert.That(result.City, Is.EqualTo(city));
        }

waynebrantley avatar Jun 14 '17 22:06 waynebrantley

Would you be open to some 'larger' changes to this library? Instead of using RegEx to parse everything, we simplify the problem space. To start with, we can remove zip and state out of the regex. These are easy to find and extract because we know what they look like and can extract them with a simple tokenizer. That would leave us with just the Address and City. This would open the opportunity for a 'hook' where the city could be looked up in a 'known city' for state/zip. If that happens to match, you quickly end up with just an address line to parse.

Additionally, each 'regex' piece in the code should be extracted out into its own tests. So you can test just that regex with unit tests. (For example, POBOX has its own part of the regex - that should have its own tests that just test POBOX not entire address) This would make finding and modifying the individual parts of the expression easier.

Just some thoughts to try and simplify what is being searched for.

waynebrantley avatar Jun 14 '17 23:06 waynebrantley

We can certainly add some tests at a more fine-grained unit level, but the integrated performance matters more. The addresses come in as a blob, so it's the parsing of that blob that matters, which is what the big regex does. Even if we go with your ideas about tokenizing first, then it's still the integrated performance of breaking up the address that ultimately matters.

So that we could track improvement/regressions, before starting on large changes I would want to build a significant locking test, perhaps including the edge cases you and others have reported and using a large sample from here: https://results.openaddresses.io/. I have some experience/ideas about building this kind of test and may have some time to work on in it, then we would have a basis for making larger changes.

jamesrcounts avatar Jun 14 '17 23:06 jamesrcounts

Performance is a feature but I would rather have it 2x slower and smarter...than fast and making mistakes. Let me know how you want to proceed and how I can help?

waynebrantley avatar Jun 15 '17 00:06 waynebrantley

Sorry, I didn't mean computational performance. I meant "ability to correctly extract addresses" as the measure of performance. So we are on the same page.

I will scaffold out the locking test, then we need to gather around 10K addresses I think.

jamesrcounts avatar Jun 15 '17 05:06 jamesrcounts

Concerning being able to use known cities to improve the results.. I found this rather cheap resource. We could have an interface that would check to see if a valid city for the state. If it was, could confidently move on - if it was not it would have to make the best guess. Uses of the library could by this and load into SQL or whatever and implement the interface to use. Or they could not use it and let it keep making the best guess.

https://www.uscitieslist.org/pricing/

Just an option/thought.

waynebrantley avatar Jun 15 '17 12:06 waynebrantley

I added some todos to the OP, and I set up the first 2 items, just need to see why it is failing on AppVeyor.

Gathering the 10k sample address would be the next step if you want to take a crack at it. see feature/bulk-testing for what I've done so far.

UPDATE: Fixed AppVeyor builds and merged the bulk testing update. You should be able to add addresses to samples.txt to include them in the test set for ParseExampleAddresses. This test uses ApprovalTests so once you insert new addresses, the test should fail and your difftool will launch so that you can add them to the approved file (see link).

I added the failing addresses you have reported and approved their current (incorrect) results. The goal of ParseExampleAddresses is to lock down the current behavior (good or bad) so that we can track the effect of changes (fix or regression). So the next step is to add a large sample of addresses from around the 50 states, and possibly territories like PR and military bases overseas if we can find examples. I know we can get PR, not sure how easy Mil addresses will be (we can always add them later, states and territories should be the priority for now).

If you want to gather the addresses, great. Otherwise, I will probably be able to do some work on that tomorrow morning.

jamesrcounts avatar Jun 15 '17 15:06 jamesrcounts

Regarding verifying cities, there is a free database of valid city, state, and zipcode combinations here:

http://federalgovernmentzipcodes.us/

The data is from 2012, but it's a good starting point if you don't want to pay the subscription fee, or if there is a licensing issue preventing you from using the data in an open-source library. I used this data when testing the pull request I made.

Jcw87 avatar Jun 15 '17 21:06 Jcw87

I did spend a minute looking at this last night and it appears to only contain "City/State/Zip". Is that correct or did I just not look deeply enough?

jamesrcounts avatar Jun 15 '17 21:06 jamesrcounts

You are correct.

Jcw87 avatar Jun 15 '17 21:06 Jcw87

Not sure where we can get the 10k addresses from?

waynebrantley avatar Jun 16 '17 13:06 waynebrantley

As mentioned previously in this thread: https://results.openaddresses.io/

Don't worry about it though. I've already downloaded the US datasets from that site and I'm looking at the data format now.

jamesrcounts avatar Jun 16 '17 14:06 jamesrcounts

I am not familar with approvals code. Can you explain why you would put the address [3360 County Road F Tekamah NE 68061,True] => into the file with the wrong city and street line as approved? Shouldn't the correct result be in the file and the test then fail?

waynebrantley avatar Jun 16 '17 14:06 waynebrantley

The concept of a locking test in legacy code isn't specific to approvals, but the idea is that we just want to use the test to document the code's current behavior. As long as that test does not fail, it means we have not made any changes to the code's behavior, in other words, this test supports the process of pure refactoring.

Next, comes the idea of fixing things. If it were me, I would feel comfortable fixing the "Country Road F" example at this point, as this case is covered by a test. When I fixed the implementation, then the locking test will fail, because the behavior of the code changed. Once it fails, we will see that it failed due to an improvement. Since improvements are good, we will update the approval file to incorporate the improved results. Run the test again, and it will pass. Now the improvement is locked in and we will be confident that subsequent fixes/refactorings will not break existing behavior.

Another technique, specific to approvals, would be to update the approval file first, to see it fail, then start making changes. Either is fine, except I wouldn't merge a failing approval to master, it should be kept local or on a branch.

Again, this locking test concept is not specific to approvals, it can be done with hundreds of individual unit tests, or it can be done with other data-driven techniques like Theories.

jamesrcounts avatar Jun 16 '17 14:06 jamesrcounts

The approvals stuff is pretty neat. See my PR.

waynebrantley avatar Jun 16 '17 14:06 waynebrantley

Accepted your PR, thanks. "Avenue C" seems like low hanging fruit. Your special case just needs to generalize out to "Country Road" OR "Avenue"

jamesrcounts avatar Jun 19 '17 14:06 jamesrcounts

Thanks for getting the PR done and for getting these 10k addresses in! Since the 10k addresses just 'document current behavior' - do we know if any of these 10k addresses are incorrect?

waynebrantley avatar Jun 20 '17 12:06 waynebrantley

Also, did you notice my link to http://regexstorm.net/tester in the code. I propose we do this for all pieces of the RegEx. Makes it easy to see a test of the regex and to work with it. I do this for all regex items in my code.

waynebrantley avatar Jun 20 '17 12:06 waynebrantley

I submitted the avenue one...but I made it a completely separate case. Reason is I think the avenue one may end up being a bit more complex as there are other cases I think it will already miss.

waynebrantley avatar Jun 20 '17 13:06 waynebrantley

Do we know the addresses are correct?

They may not all be. The data set at openaddresses.io is very large. In any large data set, there are bound to be some data problems. One thing, I didn't like about using that data set is that the addresses were already separated into fields and I formatted them to simulate user input. Would be great to have a database of junk that users actually wrote. For now, these are better than nothing. If we suspicious looking examples we can remove them from the sample.

Link to regexstorm:

I did notice you did that and I'm ambivalent about it. I think it is neat but I don't want to make it a requirement. I think it helps developers come up with the right regex, but the unit tests will be key to actually ensure quality, so that's what I'm going to focus on requiring people to do. Maybe we can add a link to the readme suggesting this tool and other similar tools for people who want to contribute?

jamesrcounts avatar Jun 20 '17 21:06 jamesrcounts

I am good with that...it is just a good practice I have found over the years.

waynebrantley avatar Jun 20 '17 22:06 waynebrantley

I think it's a little ambitious to attempt corrections on the junk that users input. But, if that is what you are interested in, I have access to thousands of addresses entered by users. I cannot simply share these addresses, however, as I'm pretty sure that it would violate some confidentiality agreement. What I can do is share a list of fake addresses that are based on the most common user errors I have seen.

https://docs.google.com/spreadsheets/d/1UZsef9QrUOXspipcG-kdHwTQD_aFocLPWy0lEwQJxHM/edit?usp=sharing

The yellow rows are the acceptable addresses, the others are the various things I've seen users do. This doesn't even get into the typos I've seen.

Errors and corrections are based on my personal experience with Endicia address verification.

Jcw87 avatar Jun 21 '17 01:06 Jcw87

@Jcw87 thanks for the list. I am not really interested in correcting the junk users input, but rather if they type a valid address I want it to parse correctly. (valid meaning it is the type of address that is invalid - not that the actual street/number exists)

We can add the ones that are valid to the list and make sure it parses them correctly. What are your thoughts @jamesrcounts

waynebrantley avatar Jun 21 '17 11:06 waynebrantley

When I originally started using this library the company I worked for was processing real estate appraisals which have many, many addresses entered into them by appraisers. So I have a lot of experience trying to make sense of the junk users input, but like @Jcw87 I was not able to directly copy those addresses into the test suite here.

The main reason I think we would benefit from processing actual user input is that there are several 'correct' ways to enter an address, particularly when it comes to punctuation. Some people use no punctuation, some like to punctuate things that don't need it. In the end, the post office is still able to figure out what they mean and deliver it.

But since the address data I harvested was already parsed I reassembled the examples with a simple format string, therefore all the examples are punctuated and arranged in the exact same way. I also rejected samples that were missing key fields like the city, yet in my experience at my previous job, I know that appraisers in Manhattan don't bother to put city/state/zip. So, in the end, I feel that the sample is a little "easier" to parse than it should be.

However, I don't think its super pressing. For now, I'd rather the focus remain on fixing the examples that users have actually reported as wrong. Just wanted to throw my thoughts out there about the 10k samples in case someone knew about or comes across a large database of unparsed user input.

Once we've worked through the addresses @waynebrantley reported, we can add more examples from @Jcw87 if there is anyone who wants to work on the issues that may come up. The truth is that if anyone is using this library in a serious way, they will eventually report edge cases and broken examples and we will collect them that way.

jamesrcounts avatar Jun 21 '17 13:06 jamesrcounts