aqp segment needs tests

This functions should have tests for expected output given:

[x] data.frame input
[x] SoilProfileCollection input
[ ] missing or illogical horizon depths
[x] same results as slab()

Jan 14 '21 19:01 dylanbeaudette

OK, I took a look at number 3 in your list and found 3 more things that need addressing.

[ ] cannot segment a single profile if the interval contains missing data (gaps)
[ ] output ID order does not match input order or input profiles
[ ] rebuild SoilProfileCollection using depths<- and JOIN methods -- not replaceHorizons<-

Here is a reprex, which is also stored in misc/testing_suite https://github.com/ncss-tech/aqp/blob/5ea7b827cbaf8794cceba28f9a8bb400c395158f/misc/test_suite/segment-reprex.R

Once the correct behavior is ironed out, they can easily become new tests.

library(aqp, warn.conflicts = FALSE)
#> This is aqp 1.27
library(testthat)
#> Warning: package 'testthat' was built under R version 4.0.3

data(jacobs2000)

# use glom to create some fake missing data (REMOVE 50-100cm zone)
input.spc <- glom(jacobs2000[4,], 50, 100, 
                 truncate = TRUE, invert = TRUE)

##################################
### .:. # 5 cannot segment a single profile if the interval contains missing data (gaps)
##################################
# Error: the 50:100 zone is missing; and only one profile
(output.spc <- segment(input.spc, 50:100))
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1

# Error: even with data above and below 50/150 -- single profile
(output.spc <- segment(input.spc, 25:150))
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1

# Error: just 1 cm of missing data in single profile case breaks it
(output.spc <- segment(input.spc, 0:51))
#> Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1

# Works
segment(input.spc, 0:50)
#> SoilProfileCollection with 1 profiles and 50 horizons
#> profile ID: id  |  horizon ID: hzID 
#> Depth range: 50 - 50 cm
#> 
#> ----- Horizons (6 / 50 rows  |  10 / 19 columns) -----
#>    id hzID top bottom name sand clay matrix_color time_saturated
#>  92-4    1   0      1   Ap   88    2    #83796FFF              0
#>  92-4    2   1      2   Ap   88    2    #83796FFF              0
#>  92-4    3   2      3   Ap   88    2    #83796FFF              0
#>  92-4    4   3      4   Ap   88    2    #83796FFF              0
#>  92-4    5   4      5   Ap   88    2    #83796FFF              0
#>  92-4    6   5      6   Ap   88    2    #83796FFF              0
#>  matrix_color_munsell
#>              10YR 5/1
#>              10YR 5/1
#>              10YR 5/1
#>              10YR 5/1
#>              10YR 5/1
#>              10YR 5/1
#> [... more horizons ...]
#> 
#> ----- Sites (1 / 1 rows  |  1 / 1 columns) -----
#>    id
#>  92-4
#> 
#> Spatial Data:
#>      [,1]
#> [1,]   NA
#> [1] NA

# Add another profile
input.spc <- combine(input.spc, jacobs2000[7,])

# Works
output.spc <- segment(input.spc, 50:100)


##################################
### .:. #6 output ID order does not match input order or input profiles
##################################

# _NEEDS_ FIX: not only does it have same IDs... it appears to change the order of inputs
(all(profile_id(input.spc) == profile_id(output.spc))) # FALSE
#> [1] FALSE

##################################
### .:. #7 does not provide new IDs for re-sampled profiles
##################################

# Error: Cannot combine because the transformed SPC has the same ID 
#  (this is a feature not a bug, resampled horizons != original data)
(show.spc <- combine(input.spc, output.spc))
#> Error in pbindlist(objects): non-unique profile IDs detected

# Inspect the default segment ID output -- this is wrong
par(mfrow=c(1,2),mar=c(0,0,2,1))
plot(input.spc, max.depth=200, color="matrix_color", plot.depth.axis=FALSE)
plot(output.spc, max.depth=200, color="matrix_color", name=NA)

# # segment should assign new profile IDs automatically -- like slice, permute_profile, etc
# # the correct IDs are something like thisthis:
output.spc2 <- output.spc

fix.idx <- match(profile_id(output.spc), profile_id(input.spc))
profile_id(output.spc2) <- paste0(profile_id(input.spc)[fix.idx], "segmented")

par(mfrow=c(1,2),mar=c(0,0,2,1))
plot(input.spc, max.depth=200, color="matrix_color", plot.depth.axis=FALSE)
plot(output.spc2, max.depth=200, color="matrix_color", name=NA)

Note: that I am deliberately accounting for reversed order w/ match / fix.idx

This is not a solution, but shows why ID, and SoilProfileCollection integrity is important.
- Why important? Numeric (site and horizon/slice) indices or positions are used to subset all the time
  - if they are shifting around order in between calls to functions because of missing data that is not good
  - to unexpected problems -- mostly outside the SoilProfileCollection and in the user's plots and code

Suggestions:

Under normal circumstances, it is impossible to make an SPC like the segment output from the base data

This is because of the forced ordering that depths<- does when starting with the horizon data

Temporary solutions: - rebuilding the SPC "fixes" the apparent issue in the algorithm - not an actual fix -- the re-ordering of profiles should be addressed in algorithm

output.spc3 <- rebuildSPC(output.spc)
#> Warning: Horizon top depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#> Warning: Horizon bottom depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#> using `hzID` as a unique horizon ID

par(mfrow=c(1,2),mar=c(0,0,2,1))
plot(input.spc, max.depth=200, color="matrix_color")
plot(output.spc3, max.depth=200, color="matrix_color", divide.hz=FALSE, name=NA)

Long term solution:
- In long term we should not calling rebuildSPC inside a function unless there is some intractable error. - we do provide one instance of this in pbindlist merging non-conformal (different source/columns) SPCs

- `segment` should build a SPC using conventional means (`depths<-`)
 - by building the SPC with depths, you avoid the problem by writing your algorithm on its own
 - let the SoilProfileCollection object handle doing the proper ordering.
   - This is why `replaceHorizons(object) <- h` is dangerous.

Opinion: - I know that I advised of this replaceHorizons issue some time ago. The old "replace" behavior of horizons() should have never existed, at least not without the proper validity checks in place, which was not the case until very recently (early/mid summer 2020). I have second thoughts about exporting replaceHorizons at all because of problems exactly like this.

Jan 20 '21 21:01 brownag

I wanted to use segment for an example I was working on -- and I came up with this example which is a more concise demo of the phenomenon I showed above, and shows why rebuildSPC() is currently "needed"

library(aqp, warn.conflicts=FALSE)
#> This is aqp 1.27

data(sp4)

depths(sp4) <- id ~ top + bottom

# does not work with unit-length segment; needs informative error
segment(sp4, 20)
#> Error in `$<-.data.frame`(`*tmp*`, "id", value = "-"): replacement has 1 row, data has 0

# this appears to work correctly, and rebuildSPC warns about the fact that NAPA and others have no data / NA depths
segment(sp4, 20:21)
#> Warning: Horizon top depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#> Warning: Horizon bottom depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#> SoilProfileCollection with 10 profiles and 10 horizons
#> profile ID: id  |  horizon ID: hzID 
#> Depth range: 21 - 21 cm
#> 
#> ----- Horizons (6 / 10 rows  |  10 / 15 columns) -----
#>         id hzID top bottom name   K   Mg  Ca CEC_7 ex_Ca_to_Mg
#>     colusa    1  20     21  Bt1 0.1 23.2 1.9  23.7        0.08
#>      glenn    2  20     21   Bt 0.3 18.9 4.5  27.5        0.20
#>      kings    3  20     21  Bt2 0.8 17.7 4.4  20.0        0.25
#>   mariposa    4  20     21  Bt2 0.3 44.3 6.2  34.1        0.14
#>  mendocino    5  20     21  Bt2 0.2 30.5 3.7  22.9        0.12
#>       napa    8  NA     NA <NA>  NA   NA  NA    NA          NA
#> [... more horizons ...]
#> 
#> ----- Sites (6 / 10 rows  |  1 / 1 columns) -----
#>         id
#>     colusa
#>      glenn
#>      kings
#>   mariposa
#>  mendocino
#>       napa
#> [... more sites ...]
#> 
#> Spatial Data: [EMPTY]

par(mar=c(0,0,2,1))
plot(sp4, color="clay")
abline(h = 20, lty=2, lwd=2)


checkHzDepthLogic(sp4)
#>                id valid depthLogic sameDepth missingDepth overlapOrGap
#> 1          colusa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 2           glenn  TRUE      FALSE     FALSE        FALSE        FALSE
#> 3           kings  TRUE      FALSE     FALSE        FALSE        FALSE
#> 4        mariposa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 5       mendocino  TRUE      FALSE     FALSE        FALSE        FALSE
#> 6            napa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 7      san benito  TRUE      FALSE     FALSE        FALSE        FALSE
#> 8          shasta  TRUE      FALSE     FALSE        FALSE        FALSE
#> 9  shasta-trinity  TRUE      FALSE     FALSE        FALSE        FALSE
#> 10         tehama  TRUE      FALSE     FALSE        FALSE        FALSE

napa, san benito and tehama do not have data in 20 - 21cm segment

checkHzDepthLogic(segment(sp4, 20:21))
#> Warning: Horizon top depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#> Warning: Horizon bottom depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
#>                id valid depthLogic sameDepth missingDepth overlapOrGap
#> 1          colusa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 2           glenn  TRUE      FALSE     FALSE        FALSE        FALSE
#> 3           kings  TRUE      FALSE     FALSE        FALSE        FALSE
#> 4        mariposa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 5       mendocino  TRUE      FALSE     FALSE        FALSE        FALSE
#> 6            napa FALSE      FALSE     FALSE         TRUE        FALSE
#> 7      san benito FALSE      FALSE     FALSE         TRUE        FALSE
#> 8          shasta  TRUE      FALSE     FALSE        FALSE        FALSE
#> 9  shasta-trinity  TRUE      FALSE     FALSE        FALSE        FALSE
#> 10         tehama FALSE      FALSE     FALSE         TRUE        FALSE

Jan 31 '21 19:01 brownag

Updated example, was curious if the various horizon subsetting approaches give the same results. This is not an extensive test and includes none of the common "problem" soil profiles.

The issue name wasn't well specified--there are a lot of questions, updates, fixes, and unresolved issues such as "what should be returned when asking for a depth range without data?".

I'm not sure if glom(..., fill = TRUE) is working as expected (dropped profiles?).

Anyway, I'm going to start a vignette that explores the various ways one might subset and aggregate soil profiles. Perhaps by outlining several use cases and examples, we can better help ourselves and others understand the various approaches.

library(aqp)


# example SPC from data.frame
data(sp4)
depths(sp4) <- id ~ top + bottom
hzdesgnname(sp4) <- 'name'


# test effect of potential sorting alpha vs. numeric
# all seem to work


# profile_id(sp4) <- as.character(1:length(sp4))

profile_id(sp4) <- sprintf("%0.2d", 1:length(sp4))

# profile_id(sp4) <- letters[1:length(sp4)]


testIt <- function(spc, top, bottom) {
  
  # keep old ID
  spc$.oldID <- profile_id(spc)
  
  # various approaches
  
  # dice() fills missing depth intervals with NA
  # default when given a formula
  .fm <- as.formula(sprintf("%s:%s ~ .", top, bottom - 1))
  d <- dice(spc, fm = .fm)
  
  # force filling missing depth intervals with NA
  g <- glom(spc, z1 = top, z2 = bottom, fill = TRUE)
  gt <- glom(spc, z1 = top, z2 = bottom, truncate = TRUE, fill = TRUE)
  
  # single NA horizon, with NA depths
  st <- hz_segment(spc, intervals = c(top, bottom))
  s <- hz_segment(spc, intervals = c(top, bottom), trim = FALSE)
  
  
  # normalize profile IDs
  # so that all can be combined / viewed together in a single SPC
  profile_id(d) <- sprintf("%s\nD", profile_id(d))
  profile_id(g) <- sprintf("%s\nG", profile_id(g))
  profile_id(gt) <- sprintf("%s\nGT", profile_id(gt))
  profile_id(s) <- sprintf("%s\nS", profile_id(s))
  profile_id(st) <- sprintf("%s\nST", profile_id(st))
  profile_id(spc) <- sprintf("%s\n", profile_id(spc))
  
  x <- combine(spc, d, g, gt, s, st)
  
  par(mar = c(0, 0, 3, 0))
  
  plotSPC(x, color = 'CEC_7', name.style = 'center-center', width = 0.4, id.style = 'top', col.palette = hcl.colors(25, palette = 'viridis'), depth.axis = list(line = -5))
  
  segments(x0 = 0, x1 = length(x) + 1, y0 = c(top, bottom), y1 = c(top, bottom), lwd = 2, lty = 3, col = 'red')
  
  invisible(x)
  
}

testIt(sp4, top = 0, bottom = 25)
# check for all-NA horizons, and equal number of sites / original ID
table(a$.oldID)

a <- testIt(sp4, top = 15, bottom = 35)
# check for all-NA horizons, and equal number of sites / original ID
table(a$.oldID)

a <- testIt(sp4, top = 20, bottom = 21)
# check for all-NA horizons, and equal number of sites / original ID
table(a$.oldID)

a <- testIt(sp4, top = 50, bottom = 60)
# check for all-NA horizons, and equal number of sites / original ID
table(a$.oldID)

May 31 '24 23:05 dylanbeaudette

I'm not sure if glom(..., fill = TRUE) is working as expected (dropped profiles?).

You are right, that is a bug in the case where no profiles have horizons in the specified interval, should be fixed in f25246d5 and added test in 3684b12

Anyway, I'm going to start a vignette that explores the various ways one might subset and aggregate soil profiles. Perhaps by outlining several use cases and examples, we can better help ourselves and others understand the various approaches.

Sounds good!

May 31 '24 23:05 brownag

Cool thanks. Do we have a sample dataset with each of the typical hz logic errors? That might be really handy for tests.

Jun 04 '24 15:06 dylanbeaudette

aqp aqp copied to clipboard

segment needs tests

aqp
aqp copied to clipboard