soilDB
soilDB copied to clipboard
problem with uncode()
The following line doesn't work when the ChoiceName != ChoiceLabel (e.g. nhiforsoigrp column in the mapunit table), which is the case for ~25% of the metadata lookup table. In these instances, line 118 converts the values to NA because they don't match the levels in the other factor. This is complicated by the fact that SOME but not ALL of factor levels maybe different, thus the preceding if statements logic overwrites SOME but not ALL of them to NA. I haven't fully determined what the best fix would be, but wanted to log this before I logged off for the day. Values from SDA and GDB (need to check LIMS) appear to return values coded as ChoiceLabel, thus it may not be necessary to use an if statement on line 116.
Cheers
Can you provide a sample dataset / query / function call where this is an issue?
The current version of uncode()
should not require ChoiceName == ChoiceLabel, in general. That is why changes were made to allow for both name and label values to be used in coding/uncoding. I thought it was working but if there is an issue with something getting uncoded incorrectly then we can definitely sort that out.
I think this mostly works... but I see what you mean about if you code()
a ChoiceLabel
and then uncode()
those coded values, you get ChoiceName
.
library(soilDB)
x <- SDA_query("SELECT DISTINCT nhiforsoigrp FROM mapunit")
#> single result set, returning a data.frame
x
#> nhiforsoigrp
#> 1 Group IB
#> 2 Group IIB
#> 3 Group IA
#> 4 <NA>
#> 5 Group IIA
#> 6 Group IC
#> 7 NC
x2 <- code(x)
x2
#> nhiforsoigrp
#> 1 2
#> 2 5
#> 3 1
#> 4 NA
#> 5 4
#> 6 3
#> 7 6
x3 <- uncode(x2)
x3
#> nhiforsoigrp
#> 1 IB
#> 2 IIB
#> 3 IA
#> 4 <NA>
#> 5 IIA
#> 6 IC
#> 7 NC
test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL") test2 <- uncode(test) table(test$nhiforsoigrp) test2 <- uncode(test[-19]) uncode(test["nhiforsoigrp"])
I think you are right that the logic should be simplified at least for the invert=FALSE
case... and as you show there is an issue when combining two different sets of factor levels that partially overlap. It creates NA except in the simplest cases (depends on the input data)
I took a crack at it and it seems to work for the following case (a mix of mapunits from NH and MA).
Before change I get:
library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit
INNER JOIN legend ON legend.lkey = mapunit.lkey
WHERE areasymbol LIKE 'NH%' OR areasymbol LIKE 'MA%'")
#> single result set, returning a data.frame
sum(is.na(test$nhiforsoigrp))
#> [1] 1582
test2 <- uncode(test)
#> Warning in `[<-.factor`(`*tmp*`, is.na(nc), value = structure(c(NA, NA, :
#> invalid factor level, NA generated
table(test2$nhiforsoigrp, useNA = "ifany")
#>
#> NC <NA>
#> 181 3273
After a change (c6f0f566) I get:
library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit
INNER JOIN legend ON legend.lkey = mapunit.lkey
WHERE areasymbol LIKE 'NH%' OR areasymbol LIKE 'MA%'")
#> single result set, returning a data.frame
sum(is.na(test$nhiforsoigrp))
#> [1] 1582
test2 <- uncode(test)
table(test2$nhiforsoigrp, useNA = "ifany")
#>
#> Group IA Group IB Group IC Group IIA Group IIB NC <NA>
#> 857 268 210 192 164 181 1582
I'll leave this issue open, please make/suggest any additional changes you see fit as long as existing tests of uncode/code pass. If you have certain things you want to become the normal expectation for these functions, add to tests/test-uncode.R.
I think the main thing is we wanted out of the recent changes (source of this problem) was for code()
to work for heterogeneous inputs (a mixture of ChoiceLabel and ChoiceName) ... but as you point out here the same behavior does not make sense/is not needed for uncode()
and can cause problems when the logic in that if statement fails (for example with overlap in names and labels)
This issue also alerted me to a problem with some of the new functions to help with explicitly setting factor levels. Again the overlap between ChoiceName and ChoiceLabel causes issues, thanks for bringing this up.
Before change:
library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL")
#> single result set, returning a data.frame
x1 <- factor(test$nhiforsoigrp, levels = get_NASIS_column_metadata("nhiforsoigrp")$ChoiceLabel)
x2 <- NASISChoiceList(list(nhiforsoigrp=test$nhiforsoigrp), choice = "ChoiceLabel")
#> Error in na.omit(as.numeric(apply(do.call("cbind", lapply(y[c("ChoiceValue", : 'list' object cannot be coerced to type 'double'
all.equal(x1, x2)
#> Error in all.equal.factor(x1, x2): object 'x2' not found
After change (2985b374):
library(soilDB)
test <- SDA_query("SELECT TOP 10000 * FROM mapunit WHERE nhiforsoigrp IS NOT NULL")
#> single result set, returning a data.frame
x1 <- factor(test$nhiforsoigrp, levels = get_NASIS_column_metadata("nhiforsoigrp")$ChoiceLabel)
x2 <- NASISChoiceList(list(nhiforsoigrp=test$nhiforsoigrp), choice = "ChoiceLabel")
all.equal(x1, x2)
#> [1] TRUE
I am going to close this issue since I think the problems that were found have been fixed for some time. Create a new issue if more bugs or unexpected results are found.