bety icon indicating copy to clipboard operation
bety copied to clipboard

Remove (and enforce?) duplicate priors

Open ashiklom opened this issue 5 years ago • 1 comments

There must be exactly one prior per variable per PFT. However, at least in the BETY dump used by default on Docker, there are at least 4 PFT-variable combinations with multiple priors:

priors_pfts <- tbl(con, "priors") %>%
  inner_join(tbl(con, "pfts_priors"),
             by = c("id" = "prior_id"),
             suffix = c(".priors", ".pfts_priors")) %>%
  collect()

priors_pfts %>%
  group_by(pft_id, variable_id) %>%
  count() %>%
  filter(n > 1)

# # A tibble: 4 x 3
# # Groups:   pft_id, variable_id [4]
#   pft_id          variable_id         n
#   <S3: integer64> <S3: integer64> <int>
# 1         16               7          2
# 2 1000000122              41          2
# 3 1000000130      1000000012          2
# 4 1000000130      1000000015          2

First, and easiest, we need to remove the problematic records from the current BETY dump (@robkooper?). Here are the duplicate records, grouped by PFT and variable, and sorted (within group) by age (oldest updated first).

# A tibble: 8 x 7
# Groups:   variable_id, pft_id [4]
  pft_id       variable_id    id.priors     id.pfts_priors distn  parama  paramb
  <S3: intege> <S3: integer6> <S3: integer> <S3: integer6> <chr>   <dbl>   <dbl>
1         16            7            189           597     unif  0.      100    
2         16            7            197    1000001394     gamma 1.50e+0   0.4  
3 1000000122           41            139    1000001711     unif  0.       10    
4 1000000122           41     1000000321    1000001698     unif  1.00e-3   1    
5 1000000130   1000000012     1000000123    1000002030     norm  6.60e+3  10    
6 1000000130   1000000012     1000000167    1000002020     norm  3.80e+3  10    
7 1000000130   1000000015     1000000310    1000002044     norm  8.60e-1   0.025
8 1000000130   1000000015     1000000311    1000002024     norm  1.10e+0   0.025

More generally, we should try to figure out a way to make sure these records don't creep into the database again. Unfortunately, it looks like there is no way to impose PostgreSQL constraints across tables, which is what would be required here. Perhaps a short term solution is to periodically run some external validation of BETY (possibly based on my dplyr code above) that makes sure that duplicates don't exist in this table.

Related to #185.

ashiklom avatar Feb 22 '19 19:02 ashiklom