treatments: Should this have a key?
The constraints spreadsheet offers (citation_id, site_id, name), but currently neither citation_id nor site_id are columns in this table. The GIST doesn't have a uniqueness constraint on this table (other than the usual primary key constraint on id).
Thinking about this again, the bulk upload code assumes that for any citation, the value of treatments.name will be unique across all treatments associated with the citation via the citations_treatments table. To see cases where this does not hold, run
select count(*), name AS "treatment name", format('%s %s', c.author, c.year) AS citation from treatments t join citations_treatments ct on ct.treatment_id = t.id join citations c on c.id = ct.citation_id group by c.author, c.year, name having count(*) > 1 order by author, year;
Often, however, like-named treatments are distinguished by definition, as the following query shows:
select count(*), name AS "treatment name", array_agg(definition) AS "definitions", format('%s %s', c.author, c.year) AS citation from treatments t join citations_treatments ct on ct.treatment_id = t.id join citations c on c.id = ct.citation_id group by c.author, c.year, name having count(*) > 1 order by author, year;
We could perhaps relax the constraint to "the value of the pair (treatments.name, treatments.definition) will be unique across all treatments associated with the citation via the citations_treatments table". This reduces the number of duplicates from 103 to 47 groups.
sites is somewhat more remotely related to treatments, and it is questionable what sort of constraint involving related sites would be applicable to the treatments table.
is this a duplicate of #249? See that issue - we can query the site_id, citatoin_id, and treatment_id from the traits or yields table.
But perhaps the constraint should just be that each treatment name be unique within the associated sites it is associated with in either the traits or yields tables.