citations value constraints: needed cleanup and decisions
author, title
- [x] Set not null and check whitespace normalized.
year
- [x] Set not null.
- [x] Add CHECK(year <= EXTRACT(year FROM NOW()) + 1).
- [ ] Decide on minimum for year and fix violations; then add constraint.
vol
- [x] Decide if we can add CHECK (vol > 0).
- no
If so:
- [x] Fix violations.
- [x] Add constraint to new migration.
pg
- [x] Decide if we can add CHECK (pg ~ '^([1-9]\d*(\u2013[1-9]\d*)?)?$').
If so:
- [x] Fix violations.
- [x] Make an issue to update the BETYdb Rails app to insert pg in the right form and/or write a trigger function to do so.
- [x] Add constraint to new migration.
url, pdf
- [x] Decide if we can constrain them to match to look like Web URLs or the empty string.
If so:
- [x] Fix violations.
- [x] Add constraints to new migration.
doi
- [x] Decide if we can add CHECK (doi ~ '^(|10.\d+(.\d+)?/.+)$').
If so:
- [x] Fix violations.
- [x] Add constraint to new migration.
Details
author, title
These will be part of a candidate key, but cleanup is required before a uniqueness constraint can be applied, so we only apply the value constraints now.
year
With two exceptions, the earliest year is 1951. The exception with year 0 looks like a test for. The exception with year 201 looks like a typo.
vol
SELECT * FROM citations WHERE NOT vol > 0;
shows the offending rows. I assume the row with "test" in it can be removed.
The other row is a book (whose title is in the journal column), so perhaps 0 here means vol is inapplicable. Note that in this case the citation is the whole book. Other cases where NULL is inapplicable are when the book is conference proceedings and the citation is thus to a small portion of the book. (In this case pg is usually non-empty.)
Options
- Easy option: just allow 0; use CHECK (vol >= 0) instead. Don't worry that no semantic distinction is being made between 0 and NULL (in fact we could go futher and replace all NULLs with 0 and add a NOT NULL constraint). Don't worry that some journal someday starts using 0 as a bona fide volume number.
- Easy option 2: Keep the constraint as is, continue to allow NULL, and change the 0 to NULL.
- Harder option: Allow both 0 and NULL, but make some semantic distinction in their uses.
pg
Violators have been fixed.
We will have to fix the Rails interface to change hyphen to n-dash when entries are added or updated.
url, pdf
These are now required to either be bona fide URLs with an http or https schema, be the empty string, or be some parenthesized non-empty string. The later accommodates entries such as "(paper copy available in blue folder)" and "(email:[email protected])".
doi
Violators have been fixed. The Rails interface should validate and/or automatically fix invalid entries.
@gsrohde is this ready to be implemented?
@dlebauer
All constraints have already been added except for minimum year.
select * from citations where year < 1900;
returns
id | author | year | title |
journal | vol | pg | url | pdf | created_at | updated_at | doi | user_i
d
------------+--------+------+---------------------------------------------------------------------------------------+------------------
---------------------------------------+-----+----+-----+-----+----------------------------+----------------------------+-----+--------
----
1000000001 | | 1800 | FACE EXPERIMENTS |
| | | | | 2014-09-25 00:24:42.949729 | 2014-09-25 00:24:42.949729 | | 1000000
003
1000000002 | | 1800 | MACROSYSTEMS SITES |
| | | | | 2015-01-07 14:41:39.61514 | 2015-01-07 14:41:39.61514 | | 1000000
005
619 | Wang | 201 | Predicted yields of short-rotation hybrid poplar (Populus spp.) for the contiguous US | Ecological Applic
ations (accepted with minor revisions) | | | | | 2012-09-20 00:22:20 | 2012-09-20 00:22:20 | |
2
780 | Osnas | 0 | test | test
| 0 | | | | 2015-01-06 21:03:58.869857 | 2015-03-12 15:20:39.276193 | |
162
(4 rows)
So two questions: Should some of these be amended or removed and should we have a minimum year constraint (and if so, what?—that's three questions).
I cleaned up Wang and Osnas. @mdietze can fix the other two at BU if necessary.
1440 would be a sensible minimum value for publication year because https://en.m.wikipedia.org/wiki/Printing_press