sites: make key
- [ ] Verify proposed key
- [ ] Eliminate NULLs from geometry column
- [ ] Eliminate NULLs from sitename column
- [ ] Consolidate duplicate rows
- [ ] Add constraints for key
- [x] Add whitespace constraint on sitename column
Proposed key
Originally, we had proposed sitename as a key, but the latest documentation proposes (lat, lon, sitename). lat and lon have since been replaced by geometry, so we could use (st_y(st_centroid(geometry)), st_x(st_centroid(geometry)), sitename), but sites with different geometry values probably really are supposed to be distinct, even if the latitude and longitude of their centroids match, so we can probably just use (geometry, sitename).
It is worth considering requiring site names to be distinct even when geometries are not identical but only very close. (For now, I'm only considering POINT locations.) For example:
SELECT SUBSTRING(s1.sitename, 1, 25) AS "site 1", SUBSTRING(s2.sitename, 1, 25) AS "site 2", ST_Y(s1.geometry) || ', ' || ST_X(s1.geometry) AS "site 1 location", ST_Y(s2.geometry) || ', ' || ST_X(s2.geometry) AS "site 2 location" FROM sites s1, sites s2 WHERE NOT s1.geometry = s2.geometry AND ST_DISTANCE(s1.geometry, s2.geometry) < .01 AND GeometryType(s1.geometry) = 'POINT' AND GeometryType(s2.geometry) = 'POINT';
shows 111 pairs of sites that are very close, some with identical site names. Even in cases where the site names differ, there may be cases where the sites should be consolidated. (I ran this in psql, but if you run in Navicat, you can probably dispense with the SUBSTRING calls.)
In any case, even if we chose to require identical geometries before we consider two rows to be duplicates, it is still worth examining existing data to see if some nearby sites should be consolidated.
Existing NULLs in geometry column
SELECT id, sitename, city, state, country FROM sites WHERE geometry IS NULL;
yields 26 rows.
Existing NULLs in sitename column
SELECT id, city, state, country, ST_AsText(geometry) FROM sites WHERE sitename IS NULL ORDER BY country, state, city, id;
yields 58 rows. These NULL site names could be converted to the empty string unless a bona fide site name is required for the uniqueness constraint.
Duplicate rows
SELECT ARRAY_AGG(id), sitename, ST_X(geometry) AS lon, ST_Y(geometry) AS lat FROM sites GROUP BY sitename, geometry HAVING COUNT(*) > 1;
yields 27 pairs of duplicates.
SELECT id, CASE WHEN sitename != '' THEN 'sitename: '||sitename ELSE city||', '||state||', '||country END AS location, ST_X(geometry) AS lon, ST_Y(geometry) AS lat FROM sites s1 WHERE EXISTS(SELECT 1 FROM sites s2 WHERE s2.id != s1.id AND s2.geometry IS NOT DISTINCT FROM s1.geometry AND s2.sitename IS NOT DISTINCT FROM s1.sitename) ORDER BY geometry, sitename;
shows more details of the rows involved. (Using IS NOT DISTINCT FROM in place of = is needed to include cases where both values are NULL.)
If we were to consider using geometry alone as a key, the query
SELECT id, SUBSTRING(CASE WHEN sitename != '' THEN 'sitename: '||sitename ELSE city||', '||state||', '||country END, 1, 90) AS location, ST_X(geometry) AS lon, ST_Y(geometry) AS lat FROM sites s1 WHERE EXISTS(SELECT 1 FROM sites s2 WHERE s2.id != s1.id AND s2.geometry = s1.geometry AND s2.sitename IS DISTINCT FROM s1.sitename) ORDER BY geometry, sitename;
shows 162 rows where the geometry matches the geometry of some other row with a different sitename. This doesn't count the 26 rows where the geometry is NULL.
Conversely, if we were to consider using sitename alone as a key, the query
SELECT id, SUBSTRING(sitename, 1, 90) AS sitename, ST_X(geometry) AS lon, ST_Y(geometry) AS lat FROM sites s1 WHERE sitename != '' AND EXISTS(SELECT 1 FROM sites s2 WHERE s2.id != s1.id AND s2.geometry IS DISTINCT FROM s1.geometry AND s2.sitename = s1.sitename) ORDER BY sitename;
shows 45 rows where the geometry value is needed to distinguish rows with identical sitename values. If we include rows where the sitename is the empty string, this increases to 237 rows.
Constraints
Assuming we go with (geometry, sitename) as a key, the constraints are
ALTER TABLE sites ALTER COLUMN geometry ADD NOT NULL;
ALTER TABLE sites ALTER COLUMN sitename ADD NOT NULL;
ALTER TABLE sites ADD CONSTRAINT unique_sitename_per_location UNIQUE (geometry, sitename);
In addition, we should add a whitespace-normalization constraint on the sitename column.
We can import this gist https://gist.github.com/dlebauer/be9eedffeadbdb71bf82 where I have standardized the city, state, and country codes (using the ArcGIS standardize addresses toolbox)