performance icon indicating copy to clipboard operation
performance copied to clipboard

`check_outliers`: Incorrect zscore values when using vector instead of dataframe

Open rempsyc opened this issue 3 years ago • 0 comments

TL;DR: check_outliers currently provides incorrect zscore values when using a vector instead of a dataframe.

Optional reprex below:


library(performance)
packageVersion("performance")
#> [1] '0.9.2.2'

Let’s get the current zscore distance:

x <- as.data.frame(mtcars$mpg)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore

Compare to this, the underlying calculation in the function, which provides continous scores:

d2 <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))

# Comparison
cbind(d1, d2, d1 == d2)
#>          d1 mtcars$mpg mtcars$mpg
#> 1  1.000000 0.15088482      FALSE
#> 2  1.000000 0.15088482      FALSE
#> 3  1.000000 0.44954345      FALSE
#> 4  1.000000 0.21725341      FALSE
#> 5  1.000000 0.23073453      FALSE
#> 6  1.000000 0.33028740      FALSE
#> 7  1.000000 0.96078893      FALSE
#> 8  1.000000 0.71501778      FALSE
#> 9  1.000000 0.44954345      FALSE
#> 10 1.000000 0.14777380      FALSE
#> 11 1.000000 0.38006384      FALSE
#> 12 1.000000 0.61235388      FALSE
#> 13 1.000000 0.46302456      FALSE
#> 14 1.000000 0.81145962      FALSE
#> 15 1.607883 1.60788262       TRUE
#> 16 1.607883 1.60788262       TRUE
#> 17 1.000000 0.89442035      FALSE
#> 18 2.042389 2.04238943       TRUE
#> 19 1.710547 1.71054652       TRUE
#> 20 2.291272 2.29127162       TRUE
#> 21 1.000000 0.23384555      FALSE
#> 22 1.000000 0.76168319      FALSE
#> 23 1.000000 0.81145962      FALSE
#> 24 1.126710 1.12671039       TRUE
#> 25 1.000000 0.14777380      FALSE
#> 26 1.196190 1.19619000       TRUE
#> 27 1.000000 0.98049211      FALSE
#> 28 1.710547 1.71054652       TRUE
#> 29 1.000000 0.71190675      FALSE
#> 30 1.000000 0.06481307      FALSE
#> 31 1.000000 0.84464392      FALSE
#> 32 1.000000 0.21725341      FALSE

Values under 1 are converted to 1 automatically somehow. How is this possible?

It seems like an artifact of the column aggregation procedure to get an overall distance score based on the max of all columns. When a single column is provided, it does not produce the expected result (this is current code):

Distance_Zscore <- sapply(as.data.frame(t(d2)), max, na.omit = TRUE, na.rm = TRUE)
Distance_Zscore
#>       V1       V2       V3       V4       V5       V6       V7       V8 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 
#>       V9      V10      V11      V12      V13      V14      V15      V16 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.607883 1.607883 
#>      V17      V18      V19      V20      V21      V22      V23      V24 
#> 1.000000 2.042389 1.710547 2.291272 1.000000 1.000000 1.000000 1.126710 
#>      V25      V26      V27      V28      V29      V30      V31      V32 
#> 1.000000 1.196190 1.000000 1.710547 1.000000 1.000000 1.000000 1.000000

We can use the following strategy used elsewhere in check_outliers instead:

Distance_Zscore <- sapply(as.data.frame(t(d2)), function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#>         V1         V2         V3         V4         V5         V6         V7 
#> 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740 0.96078893 
#>         V8         V9        V10        V11        V12        V13        V14 
#> 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388 0.46302456 0.81145962 
#>        V15        V16        V17        V18        V19        V20        V21 
#> 1.60788262 1.60788262 0.89442035 2.04238943 1.71054652 2.29127162 0.23384555 
#>        V22        V23        V24        V25        V26        V27        V28 
#> 0.76168319 0.81145962 1.12671039 0.14777380 1.19619000 0.98049211 1.71054652 
#>        V29        V30        V31        V32 
#> 0.71190675 0.06481307 0.84464392 0.21725341

Or, if we want to avoid sapply and transposition:

Distance_Zscore <- apply(d2, 1, function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#>  [1] 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740
#>  [7] 0.96078893 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388
#> [13] 0.46302456 0.81145962 1.60788262 1.60788262 0.89442035 2.04238943
#> [19] 1.71054652 2.29127162 0.23384555 0.76168319 0.81145962 1.12671039
#> [25] 0.14777380 1.19619000 0.98049211 1.71054652 0.71190675 0.06481307
#> [31] 0.84464392 0.21725341

Works also with full dataframe

x <- as.data.frame(mtcars)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore

d <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))

d2 <- apply(d, 1, function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})

d3 <- sapply(as.data.frame(t(d)), max, na.omit = TRUE, na.rm = TRUE)

# Comparison
cbind(d1, d2, d3, d1 == d2)
#>           d1       d2       d3  
#> V1  1.189901 1.189901 1.189901 1
#> V2  1.189901 1.189901 1.189901 1
#> V3  1.224858 1.224858 1.224858 1
#> V4  1.122152 1.122152 1.122152 1
#> V5  1.043081 1.043081 1.043081 1
#> V6  1.564608 1.564608 1.564608 1
#> V7  1.433903 1.433903 1.433903 1
#> V8  1.235180 1.235180 1.235180 1
#> V9  2.826755 2.826755 2.826755 1
#> V10 1.116036 1.116036 1.116036 1
#> V11 1.116036 1.116036 1.116036 1
#> V12 1.014882 1.014882 1.014882 1
#> V13 1.014882 1.014882 1.014882 1
#> V14 1.014882 1.014882 1.014882 1
#> V15 2.077505 2.077505 2.077505 1
#> V16 2.255336 2.255336 2.255336 1
#> V17 2.174596 2.174596 2.174596 1
#> V18 2.042389 2.042389 2.042389 1
#> V19 2.493904 2.493904 2.493904 1
#> V20 2.291272 2.291272 2.291272 1
#> V21 1.224858 1.224858 1.224858 1
#> V22 1.564608 1.564608 1.564608 1
#> V23 1.014882 1.014882 1.014882 1
#> V24 1.433903 1.433903 1.433903 1
#> V25 1.365821 1.365821 1.365821 1
#> V26 1.310481 1.310481 1.310481 1
#> V27 1.778928 1.778928 1.778928 1
#> V28 1.778928 1.778928 1.778928 1
#> V29 1.874010 1.874010 1.874010 1
#> V30 1.973440 1.973440 1.973440 1
#> V31 3.211677 3.211677 3.211677 1
#> V32 1.224858 1.224858 1.224858 1

For a data frame, then, the old and new method match. Even if we recalculate the distance with the new method I am proposing.

.

I propose to submit a PR correcting this as soon as #474 is merged.

Created on 2022-09-13 by the reprex package (v2.0.1)

rempsyc avatar Sep 13 '22 22:09 rempsyc