yields: make key
- [ ] Verify proposed key
- [ ] Clean up existing data
- [ ] Add constraints
Proposed key
The proposed key is ("citation_id", "site_id", "specie_id", "treatment_id", "cultivar_id", "method_id", "entity_id", "date_year", "date_month", "date_day"). The last four columns don't yet exist.
Note that the proposed key for traits includes both date and the date_year, date_month, and date_day columns. The proposed key here doesn't include date.
Not-NULL violations
SELECT * FROM yields WHERE citation_id IS NULL;
yields 12 rows.
SELECT * FROM yields WHERE site_id IS NULL;
yields 86 rows.
SELECT * FROM yields WHERE specie_id IS NULL;
yields 12 rows (the same rows as for citation_id).
SELECT COUNT(*) FROM yields WHERE treatment_id IS NULL;
shows there are 1030 rows with NULL treatment_id.
SELECT COUNT(*) FROM yields WHERE cultivar_id IS NULL;
shows there are 8390 rows with NULL cultivar_id. We either have to allow this column to be NULL or add a row (or rows) to cultivars to mean "unspecified cultivars" or "not applicable", etc.
SELECT COUNT(*) FROM yields WHERE method_id IS NULL;
shows there are 6835 rows with NULL method_id. Again, we have to decide how to handle these.
entity_id: Although the constraints spreadsheet and the GIST show entity_id as part of the proposed key, this has not yet been added to the yields table. If and when it is, we have to decide upon a value to use for existing rows unless we are prepared to give up the not-NULL constraint.
date_year, date_month, and date_day: Again, these are not yet in the yields table. We need to decide how to handle existing rows if we add them.
Uniqueness of existing data
The queries
SELECT COUNT(groups.count) FROM (SELECT COUNT(*) FROM yields GROUP BY site_id, specie_id, citation_id, cultivar_id, treatment_id, date, method_id HAVING COUNT(*) > 1) AS groups;
SELECT COUNT(*) FROM yields y1 WHERE EXISTS(SELECT 1 FROM yields y2 WHERE y2.id != y1.id AND y2.site_id IS NOT DISTINCT FROM y1.site_id AND y2.specie_id IS NOT DISTINCT FROM y1.specie_id AND y2.citation_id IS NOT DISTINCT FROM y1.citation_id AND y2.cultivar_id IS NOT DISTINCT FROM y1.cultivar_id AND y2.treatment_id IS NOT DISTINCT FROM y1.treatment_id AND y2.method_id IS NOT DISTINCT FROM y1.method_id AND y2.date IS NOT DISTINCT FROM y1.date);
show there are 1252 groups of duplicates comprising 5452 rows (out of a total of 8390). Here, I have used date in lieu of the not-yet-existing date_year, date_month, and date_day columns.
If add the mean column to the criterion for duplicates, this number drops to 1013, so as in the case for traits, much of the apparent duplication is simply a failure to record enough about a measurement to distinguish it from other measurements by anything other than its value.
(Here's the query:
SELECT COUNT(*) FROM yields y1 WHERE EXISTS(SELECT 1 FROM yields y2 WHERE y2.id != y1.id AND y2.site_id IS NOT DISTINCT FROM y1.site_id AND y2.specie_id IS NOT DISTINCT FROM y1.specie_id AND y2.citation_id IS NOT DISTINCT FROM y1.citation_id AND y2.cultivar_id IS NOT DISTINCT FROM y1.cultivar_id AND y2.treatment_id IS NOT DISTINCT FROM y1.treatment_id AND y2.method_id IS NOT DISTINCT FROM y1.method_id AND y2.date IS NOT DISTINCT FROM y1.date AND y2.mean = y1.mean);
```)
Update
- no yields with missing citation_id
- cultivar_id and method_id can be null
- can set treatment to a generic treatment that is observational and control = TRUE