wqbc icon indicating copy to clipboard operation
wqbc copied to clipboard

`summarize_wqdata()` function throws with some datasets

Open aylapear opened this issue 3 years ago • 9 comments

Two examples using the same EMS_ID/Station but different parameters/variables and in one case the function summarize_wqdata() works and provides a summary table while in the other case the table throws an error

Example where it works properly
data_works <- tibble::tibble(
            EMS_ID = c("0200016", "0200016", "0200016"),
           Station = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"),
          Variable = c("Nitrogen Total", "Nitrogen Total", "Nitrogen Total"),
              Code = c("0114", "0114", "0114"),
             Value = c(0.844, 0.949, 0.754),
             Units = c("mg/L", "mg/L", "mg/L"),
    DetectionLimit = c(0.03, 0.03, 0.03),
      ResultLetter = c(NA, NA, NA),
              Date = c("2021-11-07", "2021-11-21", "2021-12-05"),
           Outlier = c(FALSE, FALSE, FALSE),
      Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"),
       UPPER_DEPTH = as.factor(c(NA, NA, NA)),
          Detected = as.factor(c("TRUE", "TRUE", "TRUE")),
         Timeframe = as.factor(c("2021", "2021", "2021"))
)


wqbc::summarise_wqdata(
  data_works,
  by = c("EMS_ID"),
  censored = TRUE,
  na.rm = TRUE
)

Output

# A tibble: 1 × 14
  Variable       EMS_ID      n  ncen   min   max  mean median lowerQ upperQ     sd     se lowerCL upperCL
  <chr>          <chr>   <int> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 Nitrogen Total 0200016     3     0 0.754 0.949 0.849  0.845  0.793  0.901 0.0799 0.0461   0.763   0.944
Example where it fails
data_fails <- tibble::tibble(
            EMS_ID = c("0200016", "0200016", "0200016", "0200016", "0200016"),
           Station = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93"),
          Variable = c("Aluminum Total",
                       "Aluminum Total","Aluminum Total","Aluminum Total","Aluminum Total"),
              Code = c("AL-T", "AL-T", "AL-T", "AL-T", "AL-T"),
             Value = c(0.031, 0.0192, 0.397, 0.0183, 0.1),
             Units = c("mg/L", "mg/L", "mg/L", "mg/L", "mg/L"),
    DetectionLimit = c(0.5, 0.5, 0.5, 0.5, 0.5),
      ResultLetter = c(NA, NA, NA, NA, NA),
              Date = c("2020-01-05","2020-01-27",
                       "2020-02-02","2020-02-17","2020-03-01"),
           Outlier = c(FALSE, FALSE, FALSE, FALSE, FALSE),
      Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93",
                       "ELK RIVER ABOVE HIGHWAY 93"),
       UPPER_DEPTH = as.factor(c(NA, NA, NA, NA, NA)),
          Detected = as.factor(c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE")),
         Timeframe = as.factor(c("2020", "2020", "2020", "2020", "2020"))
)

wqbc::summarise_wqdata(
  data_fails,
  by = c("EMS_ID"),
  censored = TRUE,
  na.rm = TRUE
)

Output

Error in names(ret) <- c("mean", "se", LCL(x), UCL(x)) : 
  'names' attribute [4] must be the same length as the vector [0]
In addition: Warning message:
In survreg.fit(X, Y, weights, offset, init = init, controlvals = control,  :
  Ran out of iterations and did not converge
Backtrace:
     ▆
  1. └─wqbc::summarise_wqdata(...)
  2.   └─plyr::ddply(...)
  3.     └─plyr::ldply(...)
  4.       └─plyr::llply(...)
  5.         ├─plyr:::loop_apply(n, do.ply)
  6.         └─plyr `<fn>`(1L)
  7.           └─wqbc .fun(piece, ...)
  8.             ├─base::mean(ml)
  9.             └─NADA::mean(ml)
 10.               └─NADA .local(x, ...)

aylapear avatar Jan 25 '22 02:01 aylapear

@joethorley

aylapear avatar Jan 25 '22 02:01 aylapear

thanks @aylapear - I'll look into

joethorley avatar Jan 25 '22 04:01 joethorley

@joethorley do you remember if this is still ongoing? if so, perhaps something to examine what the dependencies are and what's worth updating in it's current format.

HeatherGranger avatar Nov 28 '22 22:11 HeatherGranger

The cause of this is error when all the data points are censored, specifically when the Censored = TRUE for every row.

# real data output
EMS_ID                    Station       Variable Code Value Units DetectionLimit ResultLetter       Date
1 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-05
2 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-27
3 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-02
4 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-17
5 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-03-01
  Outlier               Site_Renamed UPPER_DEPTH Detected Timeframe Censored
1   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
2   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
3   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
4   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
5   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE

Example to show which code is throwing error

# this throws error
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(TRUE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)
# As long as one censored value is false it will run
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(FALSE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)

aylapear avatar Dec 18 '22 23:12 aylapear

This error is caused by the internal function summarise_wqdata_by() in the summaries-wqdata.R file

aylapear avatar Dec 18 '22 23:12 aylapear

It also fails on a single value

> df <- tibble(
+   Value = c(0.011),
+   Censored = c(TRUE)
+ )
> ml <- with(
+   df,
+   cenmle(
+     Value,
+     Censored,
+     dist = "lognormal",
+     conf.int = 0.95
+   )
+ )
> ml
Error in exp(x@survreg$coef[1]) : 
  non-numeric argument to mathematical function
# fails
df <- tibble(
  Value = c(0.011),
  Censored = c(FALSE)
)
# passes when two values even if one is censored 
df <- tibble(
  Value = c(0.011, 0.0001),
  Censored = c(FALSE, TRUE)
)

aylapear avatar Dec 19 '22 00:12 aylapear

In these edge cases when either all values are censored or only a single value is given the function should return NA' s instead of an error.

aylapear avatar Dec 19 '22 17:12 aylapear

This would then generate this table when the site that has no data display NA's instead of throwing an error.

# A tibble: 2 × 14
  Variable       EMS_ID      n  ncen   min   max  mean median lowerQ upperQ     sd     se lowerCL upperCL
  <chr>          <chr>   <int> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 Nitrogen Total 0200016     3     0 0.754 0.949 0.849  0.845  0.793  0.901 0.0799 0.0461   0.763   0.944
1 Nitrogen Total 0478416     NA          NA      NA      NA       NA       NA      NA        NA       NA        NA        NA

aylapear avatar Dec 19 '22 18:12 aylapear

I agree - and when no data it should return a table with the same columns and classes and no rows.

joethorley avatar Dec 19 '22 18:12 joethorley