managements: make (date, mgmttype) a key
- ~~Eliminate NULL date values~~ --Won't do (see below)
- [ ] Consolidate duplicate rows
- [ ] Add NOT NULL and UNIQUE constraints
- [ ] Add whitespace-normalization constraint
NULL date values
SELECT * FROM managements WHERE mgmttype IS NULL;
yields 200 rows!
Non-uniqueness of (date, mgmttype)
SELECT ARRAY_AGG(id), date, mgmttype FROM managements GROUP BY date, mgmttype HAVING COUNT(*) > 1;
yields 678 groups, some involving as many as two dozen or more rows! This will only increase after NULLs are eliminated. Is this really a viable key, or do we need to include another column?
To see the full rows of all of the duplicates, try running
SELECT * FROM managements m WHERE EXISTS(SELECT 1 FROM managements m2 WHERE m.id != m2.id AND m.date = m2.date AND m.mgmttype = m2.mgmttype) ORDER BY mgmttype, date;
Adding constraints
The code is
ALTER TABLE managements ALTER COLUMN date SET NOT NULL;
ALTER TABLE managements ALTER COLUMN mgmttype SET NOT NULL;
ALTER TABLE managements ADD CONSTRAINT unique_date_per_mgmttype UNIQUE (date, mgmttype);
In addition to constraints for implementing the key, we should ensure mgmttype is whitespace-normalized.
date can be null. Many of these are fertilization rates:
SELECT mgmttype, count(*) as n
FROM managements
WHERE date IS NULL group by mgmttype order by n desc;
and others are rates of irrigation etc. Where the date(s) of fertilization is not known and may have occurred repeatedly over many years. So we can allow NULL dates for managements.
For the duplicates, it is only meaningful to expect that (date, mgmttype) will be unique for the subset of records associated with a single treatment_id.
Even if you group by not only date and mgmttype but level and units as well, there are plenty of duplicates even among managements associated with a common treatment. Try running
select distinct count(*), array_agg(m.id) treatment_id, date, mgmttype, level, units from managements_treatments mt join managements m on m.id = mt.management_id group by treatment_id, date, mgmttype, level, units having count(*) > 1;
At least most of the groups of duplicates are only of size 2. Decide if you want to worry about this.