`check_outliers`: Incorrect zscore values when using vector instead of dataframe
TL;DR: check_outliers currently provides incorrect zscore values when using a vector instead of a dataframe.
Optional reprex below:
library(performance)
packageVersion("performance")
#> [1] '0.9.2.2'
Let’s get the current zscore distance:
x <- as.data.frame(mtcars$mpg)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore
Compare to this, the underlying calculation in the function, which provides continous scores:
d2 <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))
# Comparison
cbind(d1, d2, d1 == d2)
#> d1 mtcars$mpg mtcars$mpg
#> 1 1.000000 0.15088482 FALSE
#> 2 1.000000 0.15088482 FALSE
#> 3 1.000000 0.44954345 FALSE
#> 4 1.000000 0.21725341 FALSE
#> 5 1.000000 0.23073453 FALSE
#> 6 1.000000 0.33028740 FALSE
#> 7 1.000000 0.96078893 FALSE
#> 8 1.000000 0.71501778 FALSE
#> 9 1.000000 0.44954345 FALSE
#> 10 1.000000 0.14777380 FALSE
#> 11 1.000000 0.38006384 FALSE
#> 12 1.000000 0.61235388 FALSE
#> 13 1.000000 0.46302456 FALSE
#> 14 1.000000 0.81145962 FALSE
#> 15 1.607883 1.60788262 TRUE
#> 16 1.607883 1.60788262 TRUE
#> 17 1.000000 0.89442035 FALSE
#> 18 2.042389 2.04238943 TRUE
#> 19 1.710547 1.71054652 TRUE
#> 20 2.291272 2.29127162 TRUE
#> 21 1.000000 0.23384555 FALSE
#> 22 1.000000 0.76168319 FALSE
#> 23 1.000000 0.81145962 FALSE
#> 24 1.126710 1.12671039 TRUE
#> 25 1.000000 0.14777380 FALSE
#> 26 1.196190 1.19619000 TRUE
#> 27 1.000000 0.98049211 FALSE
#> 28 1.710547 1.71054652 TRUE
#> 29 1.000000 0.71190675 FALSE
#> 30 1.000000 0.06481307 FALSE
#> 31 1.000000 0.84464392 FALSE
#> 32 1.000000 0.21725341 FALSE
Values under 1 are converted to 1 automatically somehow. How is this possible?
It seems like an artifact of the column aggregation procedure to get an overall distance score based on the max of all columns. When a single column is provided, it does not produce the expected result (this is current code):
Distance_Zscore <- sapply(as.data.frame(t(d2)), max, na.omit = TRUE, na.rm = TRUE)
Distance_Zscore
#> V1 V2 V3 V4 V5 V6 V7 V8
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#> V9 V10 V11 V12 V13 V14 V15 V16
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.607883 1.607883
#> V17 V18 V19 V20 V21 V22 V23 V24
#> 1.000000 2.042389 1.710547 2.291272 1.000000 1.000000 1.000000 1.126710
#> V25 V26 V27 V28 V29 V30 V31 V32
#> 1.000000 1.196190 1.000000 1.710547 1.000000 1.000000 1.000000 1.000000
We can use the following strategy used elsewhere in check_outliers instead:
Distance_Zscore <- sapply(as.data.frame(t(d2)), function(x) {
ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#> V1 V2 V3 V4 V5 V6 V7
#> 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740 0.96078893
#> V8 V9 V10 V11 V12 V13 V14
#> 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388 0.46302456 0.81145962
#> V15 V16 V17 V18 V19 V20 V21
#> 1.60788262 1.60788262 0.89442035 2.04238943 1.71054652 2.29127162 0.23384555
#> V22 V23 V24 V25 V26 V27 V28
#> 0.76168319 0.81145962 1.12671039 0.14777380 1.19619000 0.98049211 1.71054652
#> V29 V30 V31 V32
#> 0.71190675 0.06481307 0.84464392 0.21725341
Or, if we want to avoid sapply and transposition:
Distance_Zscore <- apply(d2, 1, function(x) {
ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#> [1] 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740
#> [7] 0.96078893 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388
#> [13] 0.46302456 0.81145962 1.60788262 1.60788262 0.89442035 2.04238943
#> [19] 1.71054652 2.29127162 0.23384555 0.76168319 0.81145962 1.12671039
#> [25] 0.14777380 1.19619000 0.98049211 1.71054652 0.71190675 0.06481307
#> [31] 0.84464392 0.21725341
Works also with full dataframe
x <- as.data.frame(mtcars)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore
d <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))
d2 <- apply(d, 1, function(x) {
ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
d3 <- sapply(as.data.frame(t(d)), max, na.omit = TRUE, na.rm = TRUE)
# Comparison
cbind(d1, d2, d3, d1 == d2)
#> d1 d2 d3
#> V1 1.189901 1.189901 1.189901 1
#> V2 1.189901 1.189901 1.189901 1
#> V3 1.224858 1.224858 1.224858 1
#> V4 1.122152 1.122152 1.122152 1
#> V5 1.043081 1.043081 1.043081 1
#> V6 1.564608 1.564608 1.564608 1
#> V7 1.433903 1.433903 1.433903 1
#> V8 1.235180 1.235180 1.235180 1
#> V9 2.826755 2.826755 2.826755 1
#> V10 1.116036 1.116036 1.116036 1
#> V11 1.116036 1.116036 1.116036 1
#> V12 1.014882 1.014882 1.014882 1
#> V13 1.014882 1.014882 1.014882 1
#> V14 1.014882 1.014882 1.014882 1
#> V15 2.077505 2.077505 2.077505 1
#> V16 2.255336 2.255336 2.255336 1
#> V17 2.174596 2.174596 2.174596 1
#> V18 2.042389 2.042389 2.042389 1
#> V19 2.493904 2.493904 2.493904 1
#> V20 2.291272 2.291272 2.291272 1
#> V21 1.224858 1.224858 1.224858 1
#> V22 1.564608 1.564608 1.564608 1
#> V23 1.014882 1.014882 1.014882 1
#> V24 1.433903 1.433903 1.433903 1
#> V25 1.365821 1.365821 1.365821 1
#> V26 1.310481 1.310481 1.310481 1
#> V27 1.778928 1.778928 1.778928 1
#> V28 1.778928 1.778928 1.778928 1
#> V29 1.874010 1.874010 1.874010 1
#> V30 1.973440 1.973440 1.973440 1
#> V31 3.211677 3.211677 3.211677 1
#> V32 1.224858 1.224858 1.224858 1
For a data frame, then, the old and new method match. Even if we recalculate the distance with the new method I am proposing.
.
I propose to submit a PR correcting this as soon as #474 is merged.
Created on 2022-09-13 by the reprex package (v2.0.1)