bety icon indicating copy to clipboard operation
bety copied to clipboard

species value constraints: needed cleanup and decisions

Open gsrohde opened this issue 9 years ago • 14 comments

Constraints

spcd

  • [x] Decide on range
    • spcd is a foreign key to FIA database. okay to set as integer, range 0-10000
  • [x] Implement constraint

genus

  • [x] Decide if proposed constraint is ok
    • hyphens are allowed
    • Capitalize all records genus.
  • [x] Clean up table
  • [x] Add constraint to migration

species

  • [x] Decide on a reasonable constraint
    • hyphens, spaces, periods, 'var', 'x', are all allowed
    • trigger conversion from 'x' to multiplication symbol
  • [x] Add constraint to migration

scientificname

  • [x] Decide on a reasonable constraint
    • should contain at least genus and species
  • [ ] Clean up table
  • [ ] Add constraint to migration [UPDATE 9/15/2015: I've begun a list of proposed constraints for inclusion in some future migration in the file db/migrate/PROPOSED_CONSTRAINTS. Included in this is a constraint ensuring uniqueness of scientificname except in the case where it is empty. (See discussion of "partial index" below.)]

Cleanups

  • [ ] Clean up table
    • [x] run
    • [x] delete_duplicate_species.sql
    • [ ] consolodate_species.sql
    • [ ] put cleanup scripts in PecanProject/bety-qaqc

Details

spcd

Current min and max are 0 and 1007.

genus

We would like to require genus ~ '^([A-Z][-a-z]*)?$'. Use

SELECT * FROM species WHERE NOT genus ~ '^([A-Z][-a-z]*)?$';

to view outliers.

species

We will use the following restriction: species should be zero or more space-or-hyphen-separated groups of capital letters followed by a period, sequences of two or more letters possibly followed by a period, ampersands, and times symbols. The check looks like this:

CHECK (species ~ '^(([A-Z]\.|[a-zA-Z]{2,}\.?|&|\u00d7)( |-|$))*$')

Note that the multiplication sign (×) is the preferred symbol to use for naming hybrids (see http://en.wikipedia.org/wiki/Multiplication_sign) and not the letter "x". Since standardizing the spelling (including symbols used and whitespace) of names will help prevent duplicates, we should use one or the other and not use them interchangeably when storing names in the database. And we may as well use the preferred symbol. Since users will often use the letter "x" in place of the preferred symbol, a trigger function can be used to automatically correct this during INSERTs and UPDATEs.

scientificname

Generally, we would like the scientificname to be the full binomial name of the species, and so usually it should match genus || ' ' || species, possibly followed by some qualification (var. ... or ssp. ..., for example). The query

SELECT scientificname, genus, species FROM species WHERE scientificname !~ FORMAT('^%s %s', genus, species) AND species != '';

shows 192 cases where this match fails for some reason other than that the species has been left blank. (Note that the species is blank in all but 1729 of the 50006 rows of the table!) Sometimes this is due to misspellings, but often the genus names are totally different. (Perhaps some of these are synonyms.)

It might be thought that at the very least we can require that the scientificname column value should textually include both the genus value and the species value, but even this fails in case like scientificname = "P. balsamifera x P. simonii", genus = "Populus", species = "balsamifera x simonii".

UPDATE 9/15/2015

This query:

SELECT scientificname, genus, species FROM species WHERE strpos(scientificname, genus) = 0 or strpos(scientificname, species) = 0;

shows there are 141 rows where scientificname does not textually include both genus and species. In some cases this is because the genus is abbreviated in scientificname. This should perhaps be allowed. Even so, the query

SELECT scientificname, genus, species FROM species WHERE (strpos(scientificname, genus) = 0 or strpos(scientificname, species) = 0) and strpos(scientificname, '.') = 0;

shows there are 91 violations for other reasons.

Other columns

Constraints on Kingdom, Division, Class, and Family names should probably be similar to that for genus, but we won't concern ourselves with any of the columns to the right of the updated_at column for now.

gsrohde avatar Feb 16 '15 23:02 gsrohde