priors: make (variable_id, phylogeny, notes) and (citation_id, variable_id, phylogeny) keys
- [ ] Eliminate NULLS
- [ ] Add not NULL constraints
- [ ] Consolidate or distinguish duplicates
- [ ] Add uniqueness constraints
- [ ] Ensure phylogeny column is whitespace-normalized
Eliminate existing NULLS
SELECT * FROM priors WHERE citation_id IS NULL;
yields 77 rows.
SELECT * FROM priors WHERE phylogeny IS NULL;
yields 14 rows.
SELECT * FROM priors WHERE notes IS NULL;
yields 13 rows.
variable_id is never NULL.
Add not NULL constraints
ALTER TABLE priors ALTER COLUMN citation_id SET NOT NULL;
ALTER TABLE priors ALTER COLUMN variable_id SET NOT NULL;
ALTER TABLE priors ALTER COLUMN phylogeny SET NOT NULL;
ALTER TABLE priors ALTER COLUMN notes SET NOT NULL;
Add uniqueness constraints
SELECT ARRAY_AGG(id), variable_id, phylogeny, notes FROM priors GROUP BY variable_id, phylogeny, notes HAVING COUNT(*) > 1;
yields 6 groups of duplicates of size 2 or 3.
SELECT ARRAY_AGG(id), citation_id, variable_id, phylogeny FROM priors GROUP BY variable_id, citation_id, phylogeny HAVING COUNT(*) > 1;
yields 10 groups of duplicates of size 2 or 3.
Uniqueness constraints
ALTER TABLE priors ADD CONSTRAINT unique_note_per_phylogeny_and_variable UNIQUE (variable_id, phylogeny, notes);
ALTER TABLE priors ADD CONSTRAINT unique_phylogeny_and_variable_per_citation UNIQUE (citation_id, variable_id, phylogeny);
COMMENT ON CONSTRAINT "unique_note_per_phylogeny_and_variable" ON "public"."priors" IS 'ensures that if there are two priors for the same group of plants, there will be an explanation of the difference in the notes; alternatively, there will be distinct citations that define the source';
COMMENT ON CONSTRAINT "unique_phylogeny_and_variable_per_citation" ON "public"."priors" IS 'phylogeny-variable is either unique b/c of citation or difference is written into the notes';
Whitespace normalization
Ensuring phylogeny is whitespace-normalized will strengthen the key containing it. We think notes, by its nature, should be allowed to be more free-form.
Bump! This may be related to the "uniqueness" aspect, but I think it should also be the case that each PFT can only have one prior per variable. I think this means putting a UNIQUE (pft_id, priors.variable_id) on the join of the pfts_priors and priors table.
As it is, PEcAn.DB::query.priors will fail with a duplicate rownames error if a PFT has more than one prior for a given variable.
This may be a separate issue -- if so, let me know and I'll open a separate one.