performance `check_outliers`: Clarification about thresholds and outlier scores

`check_outliers`: Clarification about thresholds and outlier scores

Open rempsyc opened this issue 1 year ago • 18 comments

This is low priority, but just putting out there to free up some space in PR #443.

There are essentially four questions relating to thresholds in check_outliers and outlier scores:

[x] 1. I think the default thresholds for zscore methods (threshold = 1.96) are conservative. I wonder whether we should follow the more conventional 3 deviations instead (conventional according to standards in psychology, that is). I don’t know about the appropriateness of the thresholds of the other methods, however. Perhaps it is intentional to have all methods have rather conservative thresholds to strongly encourage users to mindfully choose their own thresholds? Is that it? Clarification here would be appreciated. And if that's the case, I wonder whether that threshold justification would be worth a note in documentation (maybe not; but for novice users that could be helpful).
[x] 2. There are discrepancies within the method thresholds in different places within the code. So for example, the iqr threshold is set to 1.96 in the general thresholds definition, but to 1.5 per default in .check_outliers_iqr. Maybe this is not a big deal but: would it be OK to harmonize the thresholds to the primary definition (e.g., in documentation, but also in .check_outliers_thresholds) so they are all consistent? I feel like that would be less confusing.
[x] 3. What is also strange is that it seems like the iqr values don’t seem to get bigger than 1 anyway (same for the confidence interval methods)? They are either 0 or 1. Is that normal? I would appreciate clarification on the use of thresholds > 1 if their values indeed cannot be greater than 1. To be clear, I am not referring to the average of all columns for those methods, but when it is applied for each column (reprex below).
[x] 4. For the IQR and CI methods, an average (of all variable outlier scores) is then calculated for the distance (e.g., Distance_IQR), whereas the max is used for the zscore. I wonder if that is considered inconsistent, or if it is intentional because the IQR scores seem to be only 0 or 1 (so is it more informative to have the average than just 1?). In any case, I find it a bit strange since people might expect the same logic to be applied across the different methods. And using the average will be affected by the number of columns.

devtools::load_all()
#> ℹ Loading performance

# IQR
fun.iqr <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_iqr(y)
    unique(df$data_iqr$Distance_IQR)
  }) |> unlist()
  }

fun.iqr(mtcars)
#>  mpg1  mpg2   cyl  disp   hp1   hp2  drat   wt1   wt2 qsec1 qsec2    vs    am 
#>     0     1     0     0     0     1     0     0     1     0     1     0     0 
#>  gear carb1 carb2 
#>     0     0     1
fun.iqr(iris[-5])
#> Sepal.Length Sepal.Width1 Sepal.Width2 Petal.Length  Petal.Width 
#>            0            0            1            0            0
fun.iqr(airquality)
#>   Ozone1   Ozone2   Ozone3 Solar.R1 Solar.R2    Wind1    Wind2     Temp 
#>        0       NA        1        0       NA        0        1        0 
#>    Month      Day 
#>        0        0

# CI
fun.ci <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_ci(y, method = "ci")
    unique(df$data_ci$Distance_ci)
  }) |> unlist()
}

fun.ci(mtcars)
#>  mpg1  mpg2   cyl disp1 disp2   hp1   hp2 drat1 drat2   wt1   wt2 qsec1 qsec2 
#>     0     1     0     0     1     0     1     0     1     0     1     0     1 
#>    vs    am  gear carb1 carb2 
#>     0     0     0     0     1
fun.ci(iris[-5])
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           0            0           0
#> [2,]            1           1            1           1
fun.ci(airquality)
#>   Ozone1   Ozone2   Ozone3 Solar.R1 Solar.R2 Solar.R3    Wind1    Wind2 
#>        0      NaN        1        0      NaN        1        0        1 
#>    Temp1    Temp2    Month     Day1     Day2 
#>        0        1        0        0        1

# zscore can differ from 0 and 1
fun.zscore <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_zscore(y)
    unique(df$data_zscore$Distance_Zscore)
  }) |> unlist()
}

fun.zscore(mtcars) |> head()
#>     mpg1     mpg2     mpg3     mpg4     mpg5     mpg6 
#> 1.000000 1.626170 2.439254 2.069670 2.716442 1.090273
fun.zscore(iris[-5]) |> head()
#> Sepal.Length1 Sepal.Length2 Sepal.Length3 Sepal.Length4 Sepal.Length5 
#>      1.000000      1.059914      1.156270      1.348982      1.445337 
#> Sepal.Length6 
#>      1.252626
fun.zscore(airquality) |> head()
#>   Ozone1   Ozone2   Ozone3   Ozone4   Ozone5   Ozone6 
#> 1.000000 1.175541 1.059914 3.218284 1.522422 3.989131

^{Created on 2022-08-16 by the reprex package (v2.0.1)}

Aug 16 '22 23:08 rempsyc

@bwiernik I'm planning on submitting 443 before this is resolved, but just pinging you in case you have some opinion on this that I could integrate before the merge.

Aug 24 '22 14:08 rempsyc

Which do you want me to look at?

Aug 24 '22 15:08 bwiernik

All four questions above. But we can start small, with no 1.

(You can skip the reprex, it was only a demonstration of the point)

Aug 24 '22 15:08 rempsyc

Edit: I misunderstood how IQR works. I fixed the comparison of it below.

3 SDs seems appropriate to me. 1.96 will flag way too many. The Tukey 1.5 * IQR threshold used for boxplots is equivalent to 2.7 SDs, so it will flag fewer than 1.95 for z score and confidence interval, but more than 3 SDs. I think ensuring similarity in thresholds across methods is important. I suggest hewing closer to 3 SDs. So, thats 3 for zscore, 1.7 * IQR for IQR, .001 for the p-value like ones, .999 for the ci-like ones. For Cook's D, it the median of the implied F distribution, which is a good heuristic that doesn't flag too many influential cases. Pareto is similar. The OPTICS method I don't really know anything about.

So that's:


list(
  zscore = stats::qnorm(p = 1 - 0.001),
  iqr = 1.7,
  ci = 0.999,
  cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
  pareto = 0.7,
  mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  ics = 0.001,
  optics = 2 * ncol(x),
  iforest = 0.001,
  lof = 0.001
)

What do folks think?

Yes please

3 and 4. That seems like a bug. I think it might be returning logical of if the points are outside the threshold and those are coerced to numbers?

Aug 24 '22 15:08 bwiernik

Thanks. For reference, here are the other thresholds in case you'd like me to change any other one:

list(
  zscore = stats::qnorm(p = 1 - 0.025),
  zscore_robust = stats::qnorm(p = 1 - 0.025),
  iqr = 1.5,
  ci = 0.95,
  eti = 0.95,
  hdi = 0.95,
  bci = 0.95,
  cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
  pareto = 0.7,
  mahalanobis = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  mahalanobis_robust = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  mcd = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  ics = 0.025,
  optics = 2 * ncol(x),
  iforest = 0.025,
  lof = 0.025
)

Aug 24 '22 15:08 rempsyc

Ok I have resolved no 3. It is because the threshold is not used on the distance score as in other methods, but in an earlier step to calculate the distance score (which ends up 0 or 1). Therefore, it is working as intended and not a problem.

That only leaves no 4.

Aug 24 '22 16:08 rempsyc

About no 4: I think it is not a bug. I will make a reprex shortly to show that it is calculated like this on purpose. However, it does beg the question of whether it is desirable behaviour and what could be changed.

Aug 24 '22 16:08 rempsyc

I edited my comment above at the same time you were responding, so I've made some suggested threshold changes there

Aug 24 '22 16:08 bwiernik

I'm not really understanding what computation is happening in 4. Can you walk me through it?

Aug 24 '22 16:08 bwiernik

Here is the way IQR distances are calculated. Is this a bug or intentional, and should we change it?

x <- mtcars
threshold = 1.5
method = "tukey"
d <- data.frame(Row = seq_len(nrow(as.data.frame(x))))

# This is the baddie function!
for (col in seq_len(ncol(as.data.frame(x)))) {
  v <- x[, col]
  if (method == "tukey") {
    iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE)
  } else {
    iqr <- stats::IQR(v, na.rm = TRUE)
  }
  lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold)
  upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold)
  d[names(as.data.frame(x))[col]] <- ifelse(v > upper, 1,
                                            ifelse(v < lower, 1, 0)
  )
}

head(d)
#>   Row mpg cyl disp hp drat wt qsec vs am gear carb
#> 1   1   0   0    0  0    0  0    0  0  0    0    0
#> 2   2   0   0    0  0    0  0    0  0  0    0    0
#> 3   3   0   0    0  0    0  0    0  0  0    0    0
#> 4   4   0   0    0  0    0  0    0  0  0    0    0
#> 5   5   0   0    0  0    0  0    0  0  0    0    0
#> 6   6   0   0    0  0    0  0    0  0  0    0    0

So if we decompose this to understand where 0 and 1 are coming from:

col
#> [1] 11
(v <- x[, col])
#>  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75% 
#>   2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25% 
#>  -1
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75% 
#>   7

# This is the culprit: an ifelse with 1 or 0 as outcome!
ifelse(v > upper, 1, ifelse(v < lower, 1, 0))
#>  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

^{Created on 2022-08-24 by the reprex package (v2.0.1)}

Same ifelse logic for ci methods.

Aug 24 '22 16:08 rempsyc

Oh, I was misunderstanding the computation/threshold for the IQR. The existing threshold of 1.5 is equivalent to 2.7 SDs. 1.7 would be equivalent to 3 SDs. I edited the post above.

Aug 24 '22 17:08 bwiernik

Would it make sense to report the actual percentile as the IQR distance score, regardless of whether method = tukey or not, since only the cut-off values would change in my understanding? And would dividing the rank by the total number of observations be the right formula? E.g.,

(v <- mtcars$carb)
#>  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
threshold = 0.2

Distance_IQR <- rank(v, na.last = "keep")/length(v)

And then we can still identify outliers with the upper and lower boundaries as calculated before:

(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75% 
#>   2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25% 
#> 1.6
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75% 
#> 4.4

Outlier_IQR <- (ifelse(v > upper, 1, ifelse(v < lower, 1, 0)))

out <- data.frame(Distance_IQR, Outlier_IQR)
out
#>    Distance_IQR Outlier_IQR
#> 1      0.796875           0
#> 2      0.796875           0
#> 3      0.125000           1
#> 4      0.125000           1
#> 5      0.390625           0
#> 6      0.125000           1
#> 7      0.796875           0
#> 8      0.390625           0
#> 9      0.390625           0
#> 10     0.796875           0
#> 11     0.796875           0
#> 12     0.593750           0
#> 13     0.593750           0
#> 14     0.593750           0
#> 15     0.796875           0
#> 16     0.796875           0
#> 17     0.796875           0
#> 18     0.125000           1
#> 19     0.390625           0
#> 20     0.125000           1
#> 21     0.125000           1
#> 22     0.390625           0
#> 23     0.390625           0
#> 24     0.796875           0
#> 25     0.390625           0
#> 26     0.125000           1
#> 27     0.390625           0
#> 28     0.390625           0
#> 29     0.796875           0
#> 30     0.968750           1
#> 31     1.000000           1
#> 32     0.390625           0

Then for the composite score of all columns, we could take the max of all columns (like zscore, etc.), rather than the average.

^{Created on 2022-08-25 by the reprex package (v2.0.1)}

Aug 25 '22 18:08 rempsyc

@rempsyc Could you please add a short text to the NEWS file about the changes? I'm not sure I can appropriately summarise your whole work.

Aug 25 '22 19:08 strengejacke

I would prefer to use non-thresholded stats for everything and only use the thresholds when drawing threshold lines on plots or when flagging cases

Aug 27 '22 01:08 bwiernik

or when flagging cases

Yup that's the purpose of check_outliers, to identify possible outliers

Aug 27 '22 01:08 DominiqueMakowski

@bwiernik Would it make statistical sense to report the actual percentile as the IQR distance score? What about the CI distance score?

Sep 03 '22 16:09 rempsyc

Yes for both of those

Sep 03 '22 16:09 bwiernik

Ok so if everyone is alright with that, I'm going to submit a PR with all our agreed changes, including changing the distance scores to percentiles for IQR and all CI methods.

Sep 03 '22 16:09 rempsyc

performance performance copied to clipboard

`check_outliers`: Clarification about thresholds and outlier scores

performance
performance copied to clipboard