citations: add uniqueness constraint
- [x] Optional but recommended--see below. Normalize white space in title and author columns
- [x] Consolidate duplicate citation rows
- [ ] Add citation uniqueness constraint
~~Add not-NULL constraints - READY TO BE DONE~~ ~~Deferred to issue #230~~ DONE
- [x] fix records with year = NULL
Tabled for now ~~Optional--see below. Add a trigger function to ensure title and author columns are whitespace-normalized upon insertion or update~~
The minimal constraint we wish to add is
ALTER TABLE citations ADD CONSTRAINT unique_author_year_title UNIQUE (author, year, title);
To see violators, run
SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;
This currently produces 8 pairs of duplicates on ebi_production.
THE WHITESPACE NORMALIZATION OF AUTHORS AND TITLES MENTIONED BELOW HAS BEEN DONE. WE WILL NOT ATTEMPT TO NORMALIZE CAPITALIZATION.
Ideally, however, we would like to go beyond this minimal constraint and prevent the occurrence of two rows where the authors and/or titles don't exactly match but differ only minimally--for example, they differ only in whitespace or capitalization. To see the problem, run this variant of the above query:
SELECT
ARRAY_AGG(id),
REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '),
year,
REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' ')
FROM
citations
GROUP BY
REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '),
year,
REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' ')
HAVING
COUNT (*) > 1;
This produces 11 pairs of duplicates--three more than when we require exact matching.
We can go even further in this direction by using the PostgreSQL trigram extension:
CREATE EXTENSION pg_trgm; -- provides the 'similar' function
SELECT
c1.id,
c1.author,
c1.year,
c1.title,
c2.id,
c2.author,
c2.year,
c2.title
FROM
citations c1
JOIN citations c2 ON (
c1.id < c2.id
AND SIMILARITY(c1.author, c2.author) > 0.9
AND c1.year = c2.year
AND SIMILARITY(c1.title, c2.title) > 0.5
);
This finds a few more potential duplicates (compare citations with ids 12 and 13 and with ids 675 and 676 on ebi_forecast).
Options for preventing near matches.
Here are two possible sets of constraints to prevent near matches:
ALTER TABLE citations ADD CONSTRAINT normalized_titles CHECK (title = REGEXP_REPLACE(TRIM(title), ' +', ' '));
ALTER TABLE citations ADD CONSTRAINT normalized_authors CHECK (author = REGEXP_REPLACE(TRIM(author), ' +', ' '));
CREATE UNIQUE INDEX citations_natural_key ON citations (LOWER(author), year, LOWER(title));
Alternatively:
CREATE UNIQUE INDEX citations_natural_key ON citations (REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '), year, REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' '));
The first set has the advantage of ensuring the title and author fields are always whitespace-normalized. This may be a useful thing for client programs to be able to rely upon for display, searching, matching, etc.
Note that we have to switch to using unique indices where we want our uniqueness constraint to be anything more complicated than a simple column list.
The second set has the advantage of requiring minimal cleanup. But this is a small advantage, since the normalization is easily done:
UPDATE citations SET title = REGEXP_REPLACE(TRIM(title), ' +', ' ');
UPDATE citations SET author = REGEXP_REPLACE(TRIM(author), ' +', ' ');
In either case, a trigger function could also be added to ensure that the author and title columns are automatically normalized when an insertion or update occurs.
Finally, in order that (author, year, title) truly constitute a natural key, we should impose non-null constraints:
ALTER TABLE citations ALTER COLUMN author SET NOT NULL;
ALTER TABLE citations ALTER COLUMN year SET NOT NULL;
ALTER TABLE citations ALTER COLUMN title SET NOT NULL;
Currently, in ebi_production, only year is ever null. If this is meant to mean "PUBLICATION DATE UNKNOWN" or "UNPUBLISHED", special integer values should be designated to represent these meanings. Alternatively, year should be changed to some user-defined type that comprises both numbers, representing year of publication, and special non-numerical values used for other meanings.
Fixed citations with null year
- added manually through interface
- removed one with no information (citation_id = 731)
delete from citations_sites where site_id = 1160 and citation_id = 731;
delete from citations where id = 731;
- removed citation only used on orphaned management
delete from citations_treatments where citation_id = 733;
delete from managements where citation_id = 733;
delete from citations where id = 733;
Constraint
CREATE UNIQUE INDEX author_year_title ON citations (LOWER(author), year, LOWER(title));
has been added to db/migrate/PROPOSED_CONSTRAINTS.
All necessary cleanup for this has been done.
@gsrohde is this ready to be deployed and the issue closed?
@dlebauer
There still two pairs of duplicates because of new citations created in April:
SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;
array_agg | author | year | title
-------------------------+-----------------+------+------------------------------------------------------------------------------------
---------------------------------------------
{2000000010,2000000005} | Serbin, Shawn | 2015 | Remotely estimating photosynthetic capacity, and its response to temperature, in ve
getation canopies using imaging spectroscopy
{2000000003,842} | Shawn P. Serbin | 2014 | Spectroscopic determination of leaf morphological and biochemical traits for northe
rn temperate and boreal tree species
(2 rows)
Once this is fixed, I can add the uniqueness constraint.
@serbinsh can you fix these duplicate records on your machine?
Do I just need to remove one of the records for each?
Done
@serbinsh did you make sure to update all referents?
here is a function for removing duplicates: https://github.com/PecanProject/bety/issues/185 it doesn't work across sites, but this could be solved if db managers for all of the sites run it.
@dlebauer What do you mean?
I removed both dups:
bety=> SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;
array_agg | author | year | title
-----------+--------+------+-------
(0 rows)
I then dumped my DB
@serbinsh did you update all of the tables with a citation_id field?
Probably not, can you elaborate? What do you mean "update"? Sorry for being slow I just feel like I missed something
you mean using the UPDATE command?
@serbinsh If you go to http://bety-dev:3000/schemas?partial=citations_table, you can see all the tables that have a citation_id column. (Check Show One Degree and uncheck Show Two Degree to get a simplified view.) No row in any of these tables should have a value in the citation_id column corresponding to the id of a row you deleted.
@gsrohde I can't access that site.....I get a server not found error
Sorry, I meant https://www.betydb.org/schemas?partial=citations_table.
Actually, though, it's only the traits table you have to worry about. We have foreign-key constaints in place for all of the other tables, so you wouldn't be able to delete the citation if any of those other tables refer to it.
Yeah, I could not find anything on that schemas page. I think we are OK.
@serbinsh you should have seen this:

Slight correction to my earlier remark: Not only do you have to worry about traits referring to the deleted citations. It is also the case that if the citations_sites or citations_treatments table refers to a citation you are deleting, the referring rows will be silently deleted as well. managements, methods, priors, and yields will all complain if you try to delete a referred-to citation, either via SQL or the Web interface.