Tighter specification of service_area field
Leaving this open for interpretation is dangerous, especially when it is important to have well-structuredservice areas defined for purposes of searching/sorting/filtering the data that might serve a help-seeker. Rather than have this totally undefined, we should consider at least a recommendation, if not a requirement, of it's structure. I propose a semicolon-delimited list of official entity names that goes as deep as necessary to characterize each (of possibly many) service_area entries.
[2-letter ISO country code]; [2-letter ISO code for state/province]; [Full name of "county" or equivalent]; [Full name of "city" or "town"]; [Postal code]
For example, a set of entries for a single service with multiple (ludicrous) entities it serves: US; CA; Marin County; Fairfax; 94930; US; NV; US; FL; Seminole County; US; IL; Cook County; Chicago;
If we follow this route we should still have a field where the service_area can be described in free text. Following on the example above, perhaps a new field of service_area_description = "For 94930, only serves residents on the north side of Bolinas Avenue.".
Upon further thinking, we should consider creating a new object type that is well-structured for these kinds of geographical hierarchies and can be a child of the service object.
@NeilMcKechnie Do you have a suggestion for the more structured object.
I've been thinking about this and I quite like your suggested pattern of 'US; IL; Cook County; Chicago' because it nicely balances human readable for systems that are not very geo-aware, with something that more geo-aware systems can get meaning out of.
I'm concerned that a rigid 5-level structure doesn't account for variations in how communities are described (i.e. it might work well for the US, but doesn't do as well in Canada IMO). Additionally, in some areas (like Canada) a "Postal Code" may represent multiple communities and not fall nicely at the bottom of a community hierarchy.
I do agree that community precision is important, but would recommend assessing more flexible methods of adding precision that accommodate these variances. I have some things in mind that are probably not brief to describe, but would welcome a larger discussion on this. In general, my belief is that the community structure being used (including its precision detail) should be provided with the data to improve accuracy and provide opportunities for mapping regions as needed.
Excellent points @klambacher and you and I have had many great conversations on this topic. But we do need to find some approach that maximizes interoperability. For that reason I propose that we should stick with officially recognized entities that are also colloquially known.
For example, a Congressional District (US) or a Parliamentary Riding (Canada) are official entities but most people have no idea which one their home is in or the one they are currently in when away from home. That makes them not very useful in referral-making scenarios.
As well a neighborhood that is not an official entity may only be known to people who live and work there and also suffers from not having known boundaries.
Yes, Zip and Postal Codes can often span multiple entities so I would propose the data should include an entry for each entity/code combination that is relevant. It won't solve all problems, I know your valid examples where a person might live 10km from the official entity but still in the same postal code but perhaps that can be handled with Lat/Long coordinates and/or descriptive text in some way.
Relatedly, I have never found a good definitive source of definitive geographic data for the US or Canada. Candidates have included the national postal service, census bureaus, Google Geo API, OpenStreetmap and other third parties. Shockingly this is a huge void in our world.
I think @klambacher does have a geographic taxonomy for Canada. @neilmckechnie what are you looking for that OpenStreetmap or the Google Geo API don't have?
Yes I've seen @klambacher 's geographic taxonomy for Canada but it does have the "issues" (from my perspective, sorry Kate) I note above. Also I don't believe it is grounded in Lat/Long approximations which makes them not useable for proximity searches. Kate correct me if I am wrong.
Some problems with Google Geo API: When a zip code crosses a county line, G isn't "aware" of it and only lists one county the zip code is in. Often that is not the county that has the majority of the zip code. It also does represent colloquial areas like neighborhoods but not in a well structured way that is needed for defined coverage areas (e.g. what are the actual boundaries?)
OpenStreetMap: Still pretty chaotic since it is crowdsourced. Some areas are richly populated because active users there invest in the data. Other areas are woefully sparse that shouldn't be.
Interesting. One would think that shapefiles for county lines and zip codes are openly available. I'ma ask some nerds.
Hrm, @NeilMcKechnie I found this https://www.census.gov/geo/maps-data/maps/2010tract.html but apparently zipcodes are "complicated."
This all may be a derail, happy to take it offline.
Thanks @greggish but this fails the "colloquially known" test. No one knows what census track they live in, except the tiny percentage of people who work with them. Or am I missing some other point you're making?
It says the files include counties but without zipcodes I guess that doesn't help this particular challenge?
Wait - ok here! https://www.census.gov/geo/maps-data/data/cbf/cbf_counties.html
As for zipcodes - yikes what a mess. http://www.unitedstateszipcodes.org/
I haven't seen an ideal interpretation of Canadian communities yet (I'd certainly like to see more information built on what we are using now) but my concern is really that, for the standard to accommodate regional variances, the fixed 5 level structure can force a lot of gymnastics to accommodate the reality of diverse communities. Even within a single province like Ontario, the way you can easily describe Toronto vs. the way to describe the North is vastly different. It is really important to accommodate that "colloquially known" aspect and to understand some of the odd exceptions people require. Features we have found the most problematic:
-
Communities that cross regional or provincial boundaries, or legally belong to a different territory/region than the one that surrounds them
-
The need to accommodate multiple (alternate) names for the same region, due to name changes, traditional or band names, etc. It's important to be able to accommodate different ways the public might look for something.
-
The need to cluster and name regions that fall outside the hierarchy, including statistical regions, health regions, wards, and other various planning/service delivery regional structures. These regions often do not fit nicely into a hierarchical structure, but some referral providers have to provision service or report against these regions. Having the ability to cluster and name communities outside the hierarchy is important.
-
Having a hierarchy with flexible levels. There is a tremendous difference between the way someone finds services in urban vs. rural areas. The need to sub-divide a region to best find a service needs to be flexible to the needs of different areas. The population/service density variation means what's good for one isn't good for another, and postal codes in particular can be ridiculously unusable for finding services in rural areas in Canada, nor do postal codes fall at the bottom of the hierarchy in rural areas.
Bottom line was that I'd like to open the discussion so that there was more flexibility in how this could be captured/shared. I think that's very important if the long-term desire is for this format to work internationally.
All valid points Kate. I think the decision for HSDS is going to come down to "what maximizes interoperability between heterogeneous systems" and "what is the most flexible structure for searching for resources". The less structure and more flexibility we put into HSDS, the wider it is to different interpretations and hence it will be less interoperable. My mind favors maximum interoperability because if the data can't be reliably transferred between systems, then there is no hope of searching for them effectively in destination systems.
I've had a go at summarising this issue below and working up a draft proposal which would introduce a couple of new fields into the mix.
Current situation
The service_areas.csv table allows one or more service areas to be specified for a service.
It is currently described as follows:
"The service_area table contains details of the geographic area for which a service is available."
It contains three fields:
- id - Each service area must have a unique identifier"
- service_id - The identifier of the service for which this entry describes the service area
- service_area - The geographic area where a service is available. This is a free-text description, and so may be precise or indefinite as necessary.
Use cases
The concept of a service_area is used in a number of ways:
-
(1) Search - allowing a user to put in geographic terms that make sense to them, and to find services accordingly. This might involve use of colloqial names for regions, rather than officially recognised administrative geographies.
-
(2) Checking eligibility - allowing a user to check if they are in the catchment area for a given service. This may require use of clearly boundaried administrative geographies, that make it possible to check if a point location falls within the boundary covered by a service.
-
(3) Reporting - allowing a provider to report on the services they operate within a given area - which may involve widely recognised, or funder-specific, geographies, such as health-service areas, or local neighbourhoods.
Requirements
It should be possible for applications to resolve service_areas to geometries that they can work with to check if individuals fall within the service area.
It should be possible for applications to display a service_area to users in a front-end interface in ways that are intelligible to the users.
It should be possible to map between different kinds of service area (reporting, elibility etc.)
The data
There is wide variation in how address information is expressed in different countries (see wikipedia: Address (Geography) for examples of the diversity).
Many countries or forms of geography lack published shapefiles or boundary data.
Actual service areas may involve overlapping geographies, or may have fuzzy boundaries, or caveats (e.g. 'This area, but not South side of the road.')
Examples of real service_area data current produced include:
- (Restricted to those lying in Cherokee Nation Tribal Boundaries) Adair County
- 30 miles outside the greater Tulsa area.
- Zip code area served 73107 and parts of 73106. the boundaries are from Drexel to Shartel and 12th to 30th
- Zip code 73099 or Yukon school district
- United States;NY;Chemung;Elmira;14901;
- Port Hardy and Area
A service areas might ultimately resolve to:
- A point and radius - providing an estimate of the area a service covers;
- A geometry - providing the exact boundary in which the service is available;
- A geographic concept - providing a human recognisable idea of the area over which the service is delivered - but without a set boundary that would rule people in or out of eligibility;
Open Street Map provides some administative geography, but data quality is mixed. E.g. Compare the hierarchy for Stroud, UK which gives good information on parent geography levels, with that for Chicago, USA which does not.
The OSM data also does not get down to neighbourhood level.
Proposal
Based on the above discussion, and essentially coming back to Neil's second suggestion, here's a proposal for discussion.
We extend the service area table with a descsription field, and introduce an optional secondary table to contain hierarchy information.
"The service_area table contains details of the geographic area for which a service is available. There may be multiple entries in the service_area table for each service."
| field | definition |
|---|---|
| id | Each row in this table must have a unique identifier |
| service_id | The identifier of the service that covers this area. |
| service_area | A free-text label for this area. This is commmonly displayed to users, and used in searches. |
| description | A more detailed description of this service area. Used to provide any additional information that cannot be communicated using the structured area and geometry fields |
| geography_id | Where a geographic heirarchy is available in the area_reference table, the identifier of the area covered by the service can be given here. |
| point | A latitude, longitude pair giving the centre of the area served |
| radius | The distance in kilometres from the point to which service coverage extends |
We add guidance that either geography_id or point and radius need to be provided.
And we introduce the area_reference table, which can represent arbitrary hierachies using and ID and parent ID, following the same pattern as the taxonomy table.
| field | definition |
|---|---|
| id | A unique identifier for this geography |
| parent_id | If this geography is at or below the second level in a hierarchy, provide the identifier for the parent level here. |
| name | The name of this area. |
| geometry | Where a defined boundary is available for this area this should be described using a 'Well Known Text' polygon string. |
| uri | A link to further human or machine-readable information about this geography. |
This would allow:
- A choice between giving a location and radius, or a specific boundary;
- Use of informal as well as formal descriptions of locations;
- Each application to specify their own geographic hierarchy;
- Caveats on locations to be provided in the descriptions;
It doesn't perfectly capture all concepts. For example, "30 miles outside the greater Tulsa area" would most likely need to be represented with a point at the centre of the Tulsa area, and then a radius based on the average distance from centre of the area to the boundary of the area, plus 30 miles.
How important precision is here for different use-cases will, I anticipate, vary - although there is the option that an exact geomtry could also be provided in the area_reference table.
Next steps
Views on the proposal above are welcome.
We would want to mock-up some examples of data represented in this way, and to work through:
- How easy it would be to convert existing data into this structure;
- How easy it would be to query data in this structure (e.g. can we construct reasonable SQL queries or import processes that would take data from this format into systems)
References
Amongst other sources, I looked at the following for precendents and inspiration.
@timgdavies thank you for the very thoughtful response.
Your recommendation is certainly getting there but you've essentially chosen the approach the venerable @klambacher has put forward, which is a hierarchically recursive structure that otherwise can be implemented with a high degree of interpretation by each developer.
This will reduce interoperability and require that each entity consuming data need to study and write an interpretive set of code for each data source. That is far from ideal. I can articulate at least 30 different software systems I can foresee needing to exchange data with. That means just for me and those partners we would need to write a staggering amount of 1:1 interpretations. (My math is rusty: is that 30^2 or 30!...? either way, it is a very large number.)
We will already be stuck with the problem of inexact naming: Port Saint Lucie, Florida vs Pt. St. Lucie FL vs Port St. Lucie FL
Or Durham, ON (the large region) versus Durham ON (the tiny town 200 km away)
Having seen exported data from many dozens of peer systems in our Information and Referral space, the variation in how geography is expressed is quite vast and hugely problematic for interoperability. If you're not convinced I could give you a more systematic answer with real examples
The problem space would be reduced significantly if we could at least overlay on your proposal some country-specific structure, with room for some interpretation that could be ignored as desired by the consuming entity.
@NeilMcKechnie This raises a really key issue - in-so-far as HSDS has, to-date, tried to minimise specifying particular codelists or taxonomies to use.
However, I can see nothing that would mitigate against a particular community of publishers agreeing to use a common area_reference table - we just need to work out how this sits in HSDS docs etc. as a recommendation.
The idea of including options for geometry and point, was that this should allow systems to use reverse geo-coding techniques to reconcile heterogenous geographies. E.g. if you know the point location and radius of a service, you can look at all the administrative boundaries you are concerned with that it overlaps.
It avoid the 30^2 mapping - as each publisher should be providing enough information to at least support mapping their data to physical geography, and each consumer would only need a mapping from physical geography to their systems (albeit recognising that this can be lossy, in terms of exact geographies...)
We discussed this on today's Open Referral Assembly call.
My take aways from this is that there are two elements to a hierarchy: disambiguation and constraint.
(1) Disambiguation
Distinguishing, for example:
from
(2) Constraints
For example, when a postal code area crosses city or county boundaries.
E.g. US postal code 94608 falls between Emeryville and Oakland
In this case:
US; CA; Alameda County; Emeryville; 94608
could be interpreted to represent only those parts of 94608 that overlap with Emeryville.
I raised the idea that we could address the issue of differential hierarchies by having a template set in meta-data for the service_area text field.
This would provide substitution patterns to work out what is represented by each position in a ; separated list, and could also provide constants.
Using, for example, the Open Street Map admin_level values to provide a global mapping of what each position means:
AU; New South Wales; Mid-Coast Council; Stroud
can be clarified with the template:
"{OSM2}; {OSM4}, {OSM6}; {OSM9}"
or, to write-out the Australia specific terms:
"{Border}; {State or Territory Border}; {Local Government Authority}; {Suburb and Locality}"
Or, if I had a dataset that was 100% UK data, there could be a pattern to provide constants, such that my data might contain just:
Gloucestershire; Stroud
but the template would indicate something like "{OSM2=GB} {OSM4=England} > {{OSM6}}; {{OSM8}}"
This obviously needs some work - but I think offers a backwards compatible alternative to creating a separate area_reference table (which, in any case, only really meets the disambiguation and not the constraint use-case)
If we take it that a hierarchical service_area is intended to both disambiguate and constrain the service_area, then we need to document the rules to interpret this, which, at a first attempt would be:
To get the service_area for a single service
For each service_area linked to the service:
(1) Split the service_area string on ';'
(2) Resolve each of the items in the resulting array to a boundary where possible (using information from the entire string to disambiguate where relevant);
(3) Take the area covered by the intersection of all these boundaries;
Take the union of all the boundaries resulting from the process above.
This didn't make it into 1.1 - but we discussed briefly on todays assembly call about:
(A) Taking forward the clarification of disambiguation and containment
(B) Continuing to work on this issue to get towards a good solution
Coming late to the discussion here, but on today's call, I think I heard a characterization of the choice for HSDS as either "free text plus description (to inform interpretation)" [for flexibility] or "developing an opinionated model" [for machine readability and interoperability].
Borrowing from the discussion on taxonomies, could a third option be to allow for multiple systems, but also include a (machine interprable) field to indentify the system being used (e.g., service_area_type) that could be the machine interprable equivalent of "description"?
I am tentatively closing this, because my key takeaways from this thread was the need for a dedicated model (table or JSON Schema) for service_area. I'm not sure when it was added, but this is present in 3.0 which I believe addresses the initial motivation for the issue.
If we feel that service_area needs some rethinking or remodelling for future versions of the standard, please open a new issue :-)