performance
performance copied to clipboard
`check_outliers`: Clarification about thresholds and outlier scores
This is low priority, but just putting out there to free up some space in PR #443.
There are essentially four questions relating to thresholds in check_outliers
and outlier scores:
- [x] 1. I think the default thresholds for
zscore
methods (threshold = 1.96
) are conservative. I wonder whether we should follow the more conventional 3 deviations instead (conventional according to standards in psychology, that is). I don’t know about the appropriateness of the thresholds of the other methods, however. Perhaps it is intentional to have all methods have rather conservative thresholds to strongly encourage users to mindfully choose their own thresholds? Is that it? Clarification here would be appreciated. And if that's the case, I wonder whether that threshold justification would be worth a note in documentation (maybe not; but for novice users that could be helpful). - [x] 2. There are discrepancies within the method thresholds in different places within the code. So for example, the
iqr
threshold is set to 1.96 in the general thresholds definition, but to 1.5 per default in.check_outliers_iqr
. Maybe this is not a big deal but: would it be OK to harmonize the thresholds to the primary definition (e.g., in documentation, but also in.check_outliers_thresholds
) so they are all consistent? I feel like that would be less confusing. - [x] 3. What is also strange is that it seems like the
iqr
values don’t seem to get bigger than 1 anyway (same for the confidence interval methods)? They are either 0 or 1. Is that normal? I would appreciate clarification on the use of thresholds > 1 if their values indeed cannot be greater than 1. To be clear, I am not referring to the average of all columns for those methods, but when it is applied for each column (reprex below). - [x] 4. For the IQR and CI methods, an average (of all variable outlier scores) is then calculated for the distance (e.g., Distance_IQR), whereas the max is used for the zscore. I wonder if that is considered inconsistent, or if it is intentional because the IQR scores seem to be only 0 or 1 (so is it more informative to have the average than just 1?). In any case, I find it a bit strange since people might expect the same logic to be applied across the different methods. And using the average will be affected by the number of columns.
devtools::load_all()
#> ℹ Loading performance
# IQR
fun.iqr <- function(x) {
sapply(x, function(y) {
y <- as.data.frame(y)
df <- .check_outliers_iqr(y)
unique(df$data_iqr$Distance_IQR)
}) |> unlist()
}
fun.iqr(mtcars)
#> mpg1 mpg2 cyl disp hp1 hp2 drat wt1 wt2 qsec1 qsec2 vs am
#> 0 1 0 0 0 1 0 0 1 0 1 0 0
#> gear carb1 carb2
#> 0 0 1
fun.iqr(iris[-5])
#> Sepal.Length Sepal.Width1 Sepal.Width2 Petal.Length Petal.Width
#> 0 0 1 0 0
fun.iqr(airquality)
#> Ozone1 Ozone2 Ozone3 Solar.R1 Solar.R2 Wind1 Wind2 Temp
#> 0 NA 1 0 NA 0 1 0
#> Month Day
#> 0 0
# CI
fun.ci <- function(x) {
sapply(x, function(y) {
y <- as.data.frame(y)
df <- .check_outliers_ci(y, method = "ci")
unique(df$data_ci$Distance_ci)
}) |> unlist()
}
fun.ci(mtcars)
#> mpg1 mpg2 cyl disp1 disp2 hp1 hp2 drat1 drat2 wt1 wt2 qsec1 qsec2
#> 0 1 0 0 1 0 1 0 1 0 1 0 1
#> vs am gear carb1 carb2
#> 0 0 0 0 1
fun.ci(iris[-5])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 0 0 0 0
#> [2,] 1 1 1 1
fun.ci(airquality)
#> Ozone1 Ozone2 Ozone3 Solar.R1 Solar.R2 Solar.R3 Wind1 Wind2
#> 0 NaN 1 0 NaN 1 0 1
#> Temp1 Temp2 Month Day1 Day2
#> 0 1 0 0 1
# zscore can differ from 0 and 1
fun.zscore <- function(x) {
sapply(x, function(y) {
y <- as.data.frame(y)
df <- .check_outliers_zscore(y)
unique(df$data_zscore$Distance_Zscore)
}) |> unlist()
}
fun.zscore(mtcars) |> head()
#> mpg1 mpg2 mpg3 mpg4 mpg5 mpg6
#> 1.000000 1.626170 2.439254 2.069670 2.716442 1.090273
fun.zscore(iris[-5]) |> head()
#> Sepal.Length1 Sepal.Length2 Sepal.Length3 Sepal.Length4 Sepal.Length5
#> 1.000000 1.059914 1.156270 1.348982 1.445337
#> Sepal.Length6
#> 1.252626
fun.zscore(airquality) |> head()
#> Ozone1 Ozone2 Ozone3 Ozone4 Ozone5 Ozone6
#> 1.000000 1.175541 1.059914 3.218284 1.522422 3.989131
Created on 2022-08-16 by the reprex package (v2.0.1)
@bwiernik I'm planning on submitting 443 before this is resolved, but just pinging you in case you have some opinion on this that I could integrate before the merge.
Which do you want me to look at?
All four questions above. But we can start small, with no 1.
(You can skip the reprex, it was only a demonstration of the point)
Edit: I misunderstood how IQR works. I fixed the comparison of it below.
- 3 SDs seems appropriate to me. 1.96 will flag way too many. The Tukey 1.5 * IQR threshold used for boxplots is equivalent to 2.7 SDs, so it will flag fewer than 1.95 for z score and confidence interval, but more than 3 SDs. I think ensuring similarity in thresholds across methods is important. I suggest hewing closer to 3 SDs. So, thats 3 for zscore, 1.7 * IQR for IQR, .001 for the p-value like ones, .999 for the ci-like ones. For Cook's D, it the median of the implied F distribution, which is a good heuristic that doesn't flag too many influential cases. Pareto is similar. The OPTICS method I don't really know anything about.
So that's:
list(
zscore = stats::qnorm(p = 1 - 0.001),
iqr = 1.7,
ci = 0.999,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
ics = 0.001,
optics = 2 * ncol(x),
iforest = 0.001,
lof = 0.001
)
What do folks think?
- Yes please
3 and 4. That seems like a bug. I think it might be returning logical of if the points are outside the threshold and those are coerced to numbers?
Thanks. For reference, here are the other thresholds in case you'd like me to change any other one:
list(
zscore = stats::qnorm(p = 1 - 0.025),
zscore_robust = stats::qnorm(p = 1 - 0.025),
iqr = 1.5,
ci = 0.95,
eti = 0.95,
hdi = 0.95,
bci = 0.95,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
mahalanobis_robust = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
ics = 0.025,
optics = 2 * ncol(x),
iforest = 0.025,
lof = 0.025
)
Ok I have resolved no 3. It is because the threshold is not used on the distance score as in other methods, but in an earlier step to calculate the distance score (which ends up 0 or 1). Therefore, it is working as intended and not a problem.
That only leaves no 4.
About no 4: I think it is not a bug. I will make a reprex shortly to show that it is calculated like this on purpose. However, it does beg the question of whether it is desirable behaviour and what could be changed.
I edited my comment above at the same time you were responding, so I've made some suggested threshold changes there
I'm not really understanding what computation is happening in 4. Can you walk me through it?
Here is the way IQR distances are calculated. Is this a bug or intentional, and should we change it?
x <- mtcars
threshold = 1.5
method = "tukey"
d <- data.frame(Row = seq_len(nrow(as.data.frame(x))))
# This is the baddie function!
for (col in seq_len(ncol(as.data.frame(x)))) {
v <- x[, col]
if (method == "tukey") {
iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE)
} else {
iqr <- stats::IQR(v, na.rm = TRUE)
}
lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold)
upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold)
d[names(as.data.frame(x))[col]] <- ifelse(v > upper, 1,
ifelse(v < lower, 1, 0)
)
}
head(d)
#> Row mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 1 0 0 0 0 0 0 0 0 0 0 0
#> 2 2 0 0 0 0 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 0 0 0 0 0 0
#> 4 4 0 0 0 0 0 0 0 0 0 0 0
#> 5 5 0 0 0 0 0 0 0 0 0 0 0
#> 6 6 0 0 0 0 0 0 0 0 0 0 0
So if we decompose this to understand where 0 and 1 are coming from:
col
#> [1] 11
(v <- x[, col])
#> [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75%
#> 2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25%
#> -1
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75%
#> 7
# This is the culprit: an ifelse with 1 or 0 as outcome!
ifelse(v > upper, 1, ifelse(v < lower, 1, 0))
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Created on 2022-08-24 by the reprex package (v2.0.1)
Same ifelse logic for ci methods.
Oh, I was misunderstanding the computation/threshold for the IQR. The existing threshold of 1.5 is equivalent to 2.7 SDs. 1.7 would be equivalent to 3 SDs. I edited the post above.
Would it make sense to report the actual percentile as the IQR distance score, regardless of whether method = tukey
or not, since only the cut-off values would change in my understanding? And would dividing the rank by the total number of observations be the right formula? E.g.,
(v <- mtcars$carb)
#> [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
threshold = 0.2
Distance_IQR <- rank(v, na.last = "keep")/length(v)
And then we can still identify outliers with the upper and lower boundaries as calculated before:
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75%
#> 2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25%
#> 1.6
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75%
#> 4.4
Outlier_IQR <- (ifelse(v > upper, 1, ifelse(v < lower, 1, 0)))
out <- data.frame(Distance_IQR, Outlier_IQR)
out
#> Distance_IQR Outlier_IQR
#> 1 0.796875 0
#> 2 0.796875 0
#> 3 0.125000 1
#> 4 0.125000 1
#> 5 0.390625 0
#> 6 0.125000 1
#> 7 0.796875 0
#> 8 0.390625 0
#> 9 0.390625 0
#> 10 0.796875 0
#> 11 0.796875 0
#> 12 0.593750 0
#> 13 0.593750 0
#> 14 0.593750 0
#> 15 0.796875 0
#> 16 0.796875 0
#> 17 0.796875 0
#> 18 0.125000 1
#> 19 0.390625 0
#> 20 0.125000 1
#> 21 0.125000 1
#> 22 0.390625 0
#> 23 0.390625 0
#> 24 0.796875 0
#> 25 0.390625 0
#> 26 0.125000 1
#> 27 0.390625 0
#> 28 0.390625 0
#> 29 0.796875 0
#> 30 0.968750 1
#> 31 1.000000 1
#> 32 0.390625 0
Then for the composite score of all columns, we could take the max of all columns (like zscore, etc.), rather than the average.
Created on 2022-08-25 by the reprex package (v2.0.1)
@rempsyc Could you please add a short text to the NEWS file about the changes? I'm not sure I can appropriately summarise your whole work.
I would prefer to use non-thresholded stats for everything and only use the thresholds when drawing threshold lines on plots or when flagging cases
or when flagging cases
Yup that's the purpose of check_outliers, to identify possible outliers
@bwiernik Would it make statistical sense to report the actual percentile as the IQR distance score? What about the CI distance score?
Yes for both of those
Ok so if everyone is alright with that, I'm going to submit a PR with all our agreed changes, including changing the distance scores to percentiles for IQR and all CI methods.