usaddress
usaddress copied to clipboard
Adding various patterns to to training data
Hi,
I've been working with the usaddress library, and have added some patterns that I have seen fail in my datasets. This commit includes the xml files for training (training/dealstat_addresses_v1.xml) and test sets (measure_performance/test_data/dealstat_tests_v1.xml). The csv files were excluded by the .gitignore file, I'm not sure if you require these?
Patterns
- Unknown Illinois pattern #221: see the referenced Issue, I'm not sure why this was failing
- No StreetNamePostType: Sometimes common streets will be referenced without a StreetNamePostType e.g. "200 East Main, San Diego California"
- StreetNamePostType = "Grade": Not something I have come across more than once, I don't think it is very common. But I included the specific example "19 Hargrove Grade, Palm Coast FL 32137" in the training data (without a corresponding test).
- Rhode Island: "Rhode Island" is occasionally being picked up as a PlaceName not a StateName
- Direction in PlaceName: Sometimes a Direction in the PlaceName is being read as a StreetNamePostDirection e.g. "5548 Elmer Avenue, N. Hollywood, CA 91601"
- Fort Lauderdale: If the address does not have a StreetNamePostType, the "Fort" is being read in as such, rather than as part of the PlaceName e.g. "225 West Elm, Fort Lauderdale, FL 33301"
Both the nose tests and my tests are passing. Let me know how else I can be of assistance. I'm hoping to continue to add new patterns and make pull requests as I work through my datasets.