bety icon indicating copy to clipboard operation
bety copied to clipboard

orphaned treatments?

Open dlebauer opened this issue 10 years ago • 7 comments

This search shows that there treatment_ids 1-10 exist in management_treatments, but 2-10 do not have referents in traits or yields.

select distinct treatment_id from managements_treatments order by treatment_id limit 10
select distinct treatment_id from yields order by treatment_id limit 10
select distinct treatment_id from managements_treatments order by treatment_id limit 10
select distinct treatment_id from traits_and_yields_view_private order by treatment_id limit 10

But if I try to identify treatment_ids that exist in managments_treatments but not in other tables,

 select treatment_id from managements_treatments where treatment_id not in (select distinct treatment_id from traits_and_yields_view_private)
select treatment_id from managements_treatments where treatment_id not in (select distinct treatment_id from traits_and_yields_view_private)
select treatment_id from managements_treatments where treatment_id not in (select distinct id from treatments)

I tried disabling constraints but that did not help.

alter table managements disable trigger all
alter table treatments disable trigger all
alter table managements_treatments disable trigger all
alter table yields disable trigger all
alter table traits disable trigger all

dlebauer avatar Aug 08 '15 01:08 dlebauer

I'm not sure exactly why the subselects aren't working the way one would expect (though I'm sure constraints have nothing to do with it). For now, you can use a join instead, like this:

select distinct mt.treatment_id from managements_treatments mt 
        left join traits_and_yields_view_private tyvp on mt.treatment_id = tyvp.treatment_id 
                where tyvp.id is null order by mt.treatment_id;

gsrohde avatar Aug 08 '15 04:08 gsrohde

It's probably a bit more intuitive to do things this way:

SELECT DISTINCT treatment_id FROM managements_treatments AS mt 
        WHERE NOT EXISTS(
                SELECT 1 FROM traits_and_yields_view_private AS tyvp 
                        WHERE tyvp.treatment_id = mt.treatment_id
        ) ORDER BY treatment_id;

(The "AS" keyword here is optional--you can leave them out and get the same effect. And in this case at least, you don't really need to alias the table names at all except to save some typing.)

gsrohde avatar Aug 08 '15 05:08 gsrohde

This is even simpler (and perhaps faster):

SELECT DISTINCT treatment_id FROM managements_treatments 
        EXCEPT 
    SELECT treatment_id FROM traits_and_yields_view_private 
        ORDER BY treatment_id;

gsrohde avatar Aug 08 '15 05:08 gsrohde

Mystery Solved!

It appears that your original query didn't work because of SQL's perverse 3-valued logic.

SELECT treatment_id FROM managements_treatments 
        WHERE treatment_id NOT IN 
    (SELECT DISTINCT treatment_id FROM traits_and_yields_view_private);

would have worked as expected if the traits_and_yields_view_private table didn't have any NULLs in the treatment_id column. But it does, so the WHERE clause evaluates either to true [correction: false; see following note] (if the given treatment_id is found) or NULL (if it isn't). If you think of NULL as standing for an unknown value, then the WHERE clause never returns false [correction: never returns true; see following note] when the table has a NULL because we can't be sure that that NULL isn't a stand-in for the value we're looking for. So unless and until we constrain the treatment_id column of the traits_and_yields_view_private table to be non-NULL, we have to use this query instead:

SELECT treatment_id FROM managements_treatments 
        WHERE treatment_id NOT IN 
    (SELECT DISTINCT treatment_id FROM traits_and_yields_view_private
             WHERE treatment_id IS NOT NULL);

(By the way, you probably want the DISTINCT next to the first occurrance of treatment_id, not in the subselect).

gsrohde avatar Aug 08 '15 06:08 gsrohde

Sorry--I got it backwards: x IN list is never false if list contains a NULL, so x NOT IN list is never true if list contains a NULL. This means the overall SELECT statement will return zero rows, as we have seen, since the WHERE clause never evaluates to true.

To summarize:

x IN list IS TRUE if x occurs in list, regardless of whether or not NULL occurs in list
             IS NULL if x does not occur in list but NULL does
             IS FALSE if neither x nor NULL occur in list

x NOT IN list IS FALSE if x occurs in list, regardless of whether or not NULL occurs in list
                  IS NULL if x does not occur in list but NULL does
                  IS TRUE if neither x nor NULL occur in list

gsrohde avatar Aug 08 '15 12:08 gsrohde

I'm not sure why you closed this. This query shows 575 treatments are not referred to by any trait or yield:

SELECT DISTINCT treatment_id FROM managements_treatments 
        WHERE treatment_id NOT IN 
    (SELECT treatment_id FROM traits_and_yields_view_private
             WHERE treatment_id IS NOT NULL) ORDER BY treatment_id;

If you don't care about this or if this issue is duplicated in some other issue, then I guess you can close it again.

gsrohde avatar Sep 24 '15 19:09 gsrohde

These records have been orphaned and can be deleted; please delete them.

I think this would be

delete from treatments where id in (SELECT DISTINCT treatment_id FROM managements_treatments 
        WHERE treatment_id NOT IN 
    (SELECT treatment_id FROM traits_and_yields_view_private
             WHERE treatment_id IS NOT NULL));

Then delete the orphaned managements

delete from managements_treatments where treatment_id NOT IN 
   (SELECT treatment_id FROM traits_and_yields_view_private
             WHERE treatment_id IS NOT NULL);

delete from managements where id NOT IN 
   (SELECT management_id FROM managements_treatments);

dlebauer avatar Jul 12 '16 19:07 dlebauer