[x] Optional but recommended--see below. Normalize white space in title and author columns
[x] Consolidate duplicate citation rows
[ ] Add citation uniqueness constraint

~~Add not-NULL constraints - READY TO BE DONE~~ ~~Deferred to issue #230~~ DONE

[x] fix records with year = NULL

Tabled for now ~~Optional--see below. Add a trigger function to ensure title and author columns are whitespace-normalized upon insertion or update~~

The minimal constraint we wish to add is

ALTER TABLE citations ADD CONSTRAINT unique_author_year_title UNIQUE (author, year, title);

To see violators, run

SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;

This currently produces 8 pairs of duplicates on ebi_production.

THE WHITESPACE NORMALIZATION OF AUTHORS AND TITLES MENTIONED BELOW HAS BEEN DONE. WE WILL NOT ATTEMPT TO NORMALIZE CAPITALIZATION.

Ideally, however, we would like to go beyond this minimal constraint and prevent the occurrence of two rows where the authors and/or titles don't exactly match but differ only minimally--for example, they differ only in whitespace or capitalization. To see the problem, run this variant of the above query:

SELECT
    ARRAY_AGG(id),
    REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '),
    year,
    REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' ')
FROM
    citations
GROUP BY
    REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '),
    year,
    REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' ')
HAVING
    COUNT (*) > 1;

This produces 11 pairs of duplicates--three more than when we require exact matching.

We can go even further in this direction by using the PostgreSQL trigram extension:

CREATE EXTENSION pg_trgm; -- provides the 'similar' function

SELECT
    c1.id,
    c1.author,
    c1.year,
    c1.title,
    c2.id,
    c2.author,
    c2.year,
    c2.title
FROM
    citations c1
JOIN citations c2 ON (
    c1.id < c2.id
    AND SIMILARITY(c1.author, c2.author) > 0.9
    AND c1.year = c2.year
    AND SIMILARITY(c1.title, c2.title) > 0.5
);

This finds a few more potential duplicates (compare citations with ids 12 and 13 and with ids 675 and 676 on ebi_forecast).

Options for preventing near matches.

Here are two possible sets of constraints to prevent near matches:

ALTER TABLE citations ADD CONSTRAINT normalized_titles CHECK (title = REGEXP_REPLACE(TRIM(title), ' +', ' '));
ALTER TABLE citations ADD CONSTRAINT normalized_authors CHECK (author = REGEXP_REPLACE(TRIM(author), ' +', ' '));
CREATE UNIQUE INDEX citations_natural_key ON citations (LOWER(author), year, LOWER(title));

Alternatively:

CREATE UNIQUE INDEX citations_natural_key ON citations (REGEXP_REPLACE(TRIM(LOWER(author)), ' +', ' '), year, REGEXP_REPLACE(TRIM(LOWER(title)), ' +', ' '));

The first set has the advantage of ensuring the title and author fields are always whitespace-normalized. This may be a useful thing for client programs to be able to rely upon for display, searching, matching, etc.

Note that we have to switch to using unique indices where we want our uniqueness constraint to be anything more complicated than a simple column list.

The second set has the advantage of requiring minimal cleanup. But this is a small advantage, since the normalization is easily done:

UPDATE citations SET title = REGEXP_REPLACE(TRIM(title), ' +', ' ');
UPDATE citations SET author = REGEXP_REPLACE(TRIM(author), ' +', ' ');

In either case, a trigger function could also be added to ensure that the author and title columns are automatically normalized when an insertion or update occurs.

Finally, in order that (author, year, title) truly constitute a natural key, we should impose non-null constraints:

ALTER TABLE citations ALTER COLUMN author SET NOT NULL;
ALTER TABLE citations ALTER COLUMN year SET NOT NULL;
ALTER TABLE citations ALTER COLUMN title SET NOT NULL;

Currently, in ebi_production, only year is ever null. If this is meant to mean "PUBLICATION DATE UNKNOWN" or "UNPUBLISHED", special integer values should be designated to represent these meanings. Alternatively, year should be changed to some user-defined type that comprises both numbers, representing year of publication, and special non-numerical values used for other meanings.

Jan 23 '15 20:01 gsrohde

Fixed citations with null year

added manually through interface
removed one with no information (citation_id = 731)

delete from citations_sites where site_id = 1160 and citation_id = 731;
delete from citations where id = 731;

removed citation only used on orphaned management

delete from citations_treatments where citation_id = 733;
delete from managements where citation_id = 733;
delete from citations where id = 733;

Jan 26 '15 17:01 dlebauer

Constraint

CREATE UNIQUE INDEX author_year_title ON citations (LOWER(author), year, LOWER(title));

has been added to db/migrate/PROPOSED_CONSTRAINTS.

All necessary cleanup for this has been done.

Sep 18 '15 20:09 gsrohde

@gsrohde is this ready to be deployed and the issue closed?

Jul 12 '16 19:07 dlebauer

@dlebauer

There still two pairs of duplicates because of new citations created in April:

SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;
        array_agg        |     author      | year |                                                              title                 

-------------------------+-----------------+------+------------------------------------------------------------------------------------
---------------------------------------------
 {2000000010,2000000005} | Serbin, Shawn   | 2015 | Remotely estimating photosynthetic capacity, and its response to temperature, in ve
getation canopies using imaging spectroscopy
 {2000000003,842}        | Shawn P. Serbin | 2014 | Spectroscopic determination of leaf morphological and biochemical traits for northe
rn temperate and boreal tree species
(2 rows)

Once this is fixed, I can add the uniqueness constraint.

Aug 03 '16 22:08 gsrohde

@serbinsh can you fix these duplicate records on your machine?

Aug 03 '16 22:08 dlebauer

Do I just need to remove one of the records for each?

Sep 28 '16 19:09 serbinsh

Done

Sep 28 '16 19:09 serbinsh

@serbinsh did you make sure to update all referents?

here is a function for removing duplicates: https://github.com/PecanProject/bety/issues/185 it doesn't work across sites, but this could be solved if db managers for all of the sites run it.

Sep 28 '16 19:09 dlebauer

@dlebauer What do you mean?

I removed both dups:

bety=> SELECT ARRAY_AGG(id), author, year, title FROM citations GROUP BY author, year, title HAVING count(*) > 1;
 array_agg | author | year | title
-----------+--------+------+-------
(0 rows)

I then dumped my DB

Sep 28 '16 19:09 serbinsh

@serbinsh did you update all of the tables with a citation_id field?

Sep 28 '16 19:09 dlebauer

Probably not, can you elaborate? What do you mean "update"? Sorry for being slow I just feel like I missed something

you mean using the UPDATE command?

Sep 28 '16 19:09 serbinsh

@serbinsh If you go to http://bety-dev:3000/schemas?partial=citations_table, you can see all the tables that have a citation_id column. (Check Show One Degree and uncheck Show Two Degree to get a simplified view.) No row in any of these tables should have a value in the citation_id column corresponding to the id of a row you deleted.

Sep 28 '16 19:09 gsrohde

@gsrohde I can't access that site.....I get a server not found error

Sep 28 '16 19:09 serbinsh

Sorry, I meant https://www.betydb.org/schemas?partial=citations_table.

Sep 28 '16 19:09 gsrohde

Actually, though, it's only the traits table you have to worry about. We have foreign-key constaints in place for all of the other tables, so you wouldn't be able to delete the citation if any of those other tables refer to it.

Sep 28 '16 19:09 gsrohde

Yeah, I could not find anything on that schemas page. I think we are OK.

Sep 28 '16 19:09 serbinsh

@serbinsh you should have seen this:

Sep 28 '16 20:09 dlebauer

Slight correction to my earlier remark: Not only do you have to worry about traits referring to the deleted citations. It is also the case that if the citations_sites or citations_treatments table refers to a citation you are deleting, the referring rows will be silently deleted as well. managements, methods, priors, and yields will all complain if you try to delete a referred-to citation, either via SQL or the Web interface.

Sep 28 '16 20:09 gsrohde

citations: add uniqueness constraint

Options for preventing near matches.

Fixed citations with null year