performance
performance copied to clipboard
`check_outlier` improvement (easystats/datawizard#177)
Context
This is a pull request aiming to improve the printing method of check_outliers
, based on easystats/datawizard#177.
Specifically, it aims to accomplish the following in the print output: (a) state the methods used; (b) state the thresholds used; and (c) state the variables tested. Additionally, it also aims to (d) report outliers per variable (for univariate methods), (e) report whether any observation comes up as outlier for several variables (when that is the case), and (f) include an optional ID variable along the row information. The changes were inspired by rempsyc::find_mad.
This is a prototype/proof of concept. (a) to (c) were implemented for all methods, but (d) to (f) were only implemented for method "zscore"
for now. Before working on this further, I would like to get feedback to know whether it is worth implementing for other methods, and if modifications are needed before proceeding (as I would need to adapt the code to each method individually).
Reprex
Reprex demo of the changes below:
# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")
# Basic test
performance::check_outliers(data, method = c("mahalanobis", "mcd", "zscore"))
#> 4 outliers detected: cases 9, 19, 30, 31.
#> - Based on the following methods: mahalanobis, mcd, zscore.
#> - Using the following thresholds: 21.92, 21.92, 1.96.
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> ------------------------------------------------------------------------
#> The following observations were considered outliers for more than one variable by the univariate methods:
#>
#> Row n_Zscore
#> 9 31 2
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (univariate methods):
#>
#> $mpg
#> Row Distance_Zscore
#> 1 18 2.042389
#> 2 20 2.291272
#>
#> $hp
#> Row Distance_Zscore
#> 1 31 2.746567
#>
#> $drat
#> Row Distance_Zscore
#> 1 19 2.493904
#>
#> $wt
#> Row Distance_Zscore
#> 1 15 2.077505
#> 2 16 2.255336
#> 3 17 2.174596
#>
#> $qsec
#> Row Distance_Zscore
#> 1 9 2.826755
#>
#> $carb
#> Row Distance_Zscore
#> 1 30 1.973440
#> 2 31 3.211677
# Add ID information
outliers_list <- performance::check_outliers(
data, method = c("mahalanobis", "mcd", "zscore"), ID = "car")
outliers_list
#> 4 outliers detected: cases 9, 19, 30, 31.
#> - Based on the following methods: mahalanobis, mcd, zscore.
#> - Using the following thresholds: 21.92, 21.92, 1.96.
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> ------------------------------------------------------------------------
#> The following observations were considered outliers for more than one variable by the univariate methods:
#>
#> Row car n_Zscore
#> 9 31 Maserati Bora 2
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (univariate methods):
#>
#> $mpg
#> Row car Distance_Zscore
#> 1 18 Fiat 128 2.042389
#> 2 20 Toyota Corolla 2.291272
#>
#> $hp
#> Row car Distance_Zscore
#> 1 31 Maserati Bora 2.746567
#>
#> $drat
#> Row car Distance_Zscore
#> 1 19 Honda Civic 2.493904
#>
#> $wt
#> Row car Distance_Zscore
#> 1 15 Cadillac Fleetwood 2.077505
#> 2 16 Lincoln Continental 2.255336
#> 3 17 Chrysler Imperial 2.174596
#>
#> $qsec
#> Row car Distance_Zscore
#> 1 9 Merc 230 2.826755
#>
#> $carb
#> Row car Distance_Zscore
#> 1 30 Ferrari Dino 1.973440
#> 2 31 Maserati Bora 3.211677
# Since only the printing method is modified, old features still work:
# The object is a binary vector...
filtered_data <- data[!outliers_list, ] # And can be used to filter a dataframe
nrow(filtered_data) # New size, 28 (4 outliers removed)
#> [1] 28
# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
#> Distance_Zscore Outlier_Zscore Distance_Mahalanobis Outlier_Mahalanobis
#> 1 1.189901 0 8.946673 0
#> 2 1.189901 0 8.287933 0
#> 3 1.224858 0 8.937150 0
#> 4 1.122152 0 6.096726 0
#> 5 1.043081 0 5.429061 0
#> 6 1.564608 0 8.877558 0
#> Distance_MCD Outlier_MCD Outlier
#> 1 11.508353 0 0
#> 2 8.618865 0 0
#> 3 12.265382 0 0
#> 4 14.351997 0 0
#> 5 8.639128 0 0
#> 6 12.003840 0 0
outliers_info$Outlier # Including the probability of being an outlier
#> [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333
#> [8] 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.3333333 0.3333333 0.3333333 0.3333333 0.6666667 0.3333333 0.3333333
#> [22] 0.0000000 0.0000000 0.3333333 0.0000000 0.0000000 0.3333333 0.3333333
#> [29] 0.3333333 0.6666667 0.6666667 0.0000000
# For statistical models ---------------------------------------------
model <- lm(disp ~ mpg + hp, data = mtcars)
mod.outliers <- check_outliers(model)
mod.outliers
#> 1 outliers detected: cases 31.
#> - Based on the following methods: cook, pareto.
#> - Using the following thresholds: 0.81, 0.7.
#> - For variables: (Whole model)
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
# Check plots
plot(mod.outliers)
check_model(model)
# However, there seems to be a presentation issue when using a
# vector instead of a dataframe because then it is not possible to
# obtain the column name (since it has none), so it appears as x instead.
# Find all observations beyond +/- 2 SD
check_outliers(data$mpg, method = "zscore", threshold = 2)
#> 2 outliers detected: cases 18, 20.
#> - Based on the following methods: zscore.
#> - Using the following thresholds: 2.
#> - For variables: x
Created on 2022-07-01 by the reprex package (v2.0.1)
Observations
- I got used to programming with
dplyr
, so it was a nice challenge attempting to convert everything to base R anddatawizard
. Feel free to make suggestions to improve the code. - There was a method called
robust
in thresholds referring tomahalanobis_robust
, and I have changed it as such to avoid confusion with e.g.,zscore_robust
and to be consistent with method names to be able to refer to it back later. There was also a threshold calledzscore
but not one calledzscore_robust
, as the first one was used in both cases. Again, for clarity and compatibility with later code, I have madezscore_robust
its own. - Personally, I don’t like the output printing in red, it’s difficult to read (I’m using a dark theme so the contrast isn’t good). The green is OK though. I tried modifying the red to a smoother red but it seems few colours are allowed with
insight::print_color
(seems a limitation ofcat()
; below)? In any case, I think it shouldn’t be red anyway, that would be more for errors, so I picked yellow for now, as I feel it is functionally close and much more readable.
In .colour(colour = color, x = text) : `color` #FF4040 not yet supported.
- I got rid of the warning bit at the beginning of the output. It seems overkill since detecting outliers is the goal of the function, so it’s almost confusing ("is there something wrong with the outlier detection process?", one might wonder). It also adds text that adds no real information.
- Currently, information about outliers belonging to which variables is not easily accessible. Thus, I had to apply e.g,
.check_outliers_zscore
again on individual columns withlapply
. - As seen in the reprex, there seems to be a problem when using a vector instead of a dataframe because then it is not possible to obtain the column name (since it has none), so it appears as x instead. Perhaps there would be a way to make printing the variables line contingent on providing a dataframe.
- I also corrected some minor typos.
- The output formatting can be modified if you have a particular formatting format at the
easyverse
that I have missed. Open to suggestions for improvement.
What's next
- [x] Implement per-variable output for each of the other univariate methods:
-
- [x] "
zscore
"
- [x] "
-
- [x] "
zscore_robust
"
- [x] "
-
- [x] "
iqr
"
- [x] "
-
- [x] "
ci
"
- [x] "
-
- [x] "
eti
"
- [x] "
-
- [x] "
hdi
"
- [x] "
-
- [x] "
bci
"
- [x] "
- [x] Integrate all univariate methods in the outlier frequency table.
- [x] Also include multivariate detections in the outlier frequency table (maybe?) since column names don't need to be specified so that should make them compatible. That would mean adding support for all multivariate/model-specific methods:
-
- [x] "
cook
"
- [x] "
-
- [x] "
pareto
"
- [x] "
-
- [x] "
mahalanobis
"
- [x] "
-
- [x] "
mahalanobis_robust
"
- [x] "
-
- [x] "
mcd
"
- [x] "
-
- [x] "
ics
"
- [x] "
-
- [x] "
optics
"
- [x] "
-
- [ ] "
iforest
"
- [ ] "
-
- [x] "
lof
"
- [x] "
- [x] Add support for grouped data frames
- [x] Add support for
check_outliers.BFBayesFactor
- [x] Add support for
check_outliers.gls
Questions
- Right now, the thresholds are displayed on a separate line. I was wondering if it would make sense to save one line by doing it instead like this:
- Based on the following methods and thresholds: mahalanobis (21.92), iqr (1.5), zscore (1.96).
- At first, I was tempted to add (s) to all places where words could be either plural or singular, like so:
9 outlier(s) detected: ...
- Based on the following method(s): ...
- Using the following threshold(s): ...
- For variable(s): ...
- But I felt it impacted readability because cases were already given in parentheses on the first line (so I switched the parentheses for a colon). Other possibilities would be to report the cases on its own line (and use (s)) or yet again adapt the function to print a different message depending on the number of cases/methods/variables. Might be overthinking and not at all necessary though.
- Since using multiple methods aims to reach a consensus (composite scores > 0.5), the number of outliers reported at the top can differ from the number of outliers per variable as reported at the bottom (for the univariate methods).
- I think it might be a bit less interesting to get the detailed output when using several methods since there is already a decision protocol in place. Would it make more sense to only print that part when a single, univariate method is selected?
- Right now, outliers per variable are computed separately, but we could add the row and ID columns in the utilities section (
.check_outliers_zscore
, etc.) so that this info is also part of the outlier info data frame (outliers_info
in the examples). Only if useful though. - One of the challenges of adapting
rempsyc::find_mad
tocheck_outliers
is that the former only uses one method (zscore_robust
), whereas the latter needs to support multiple methods, which complicates the output formatting, especially for the per-variable section.- For example, it makes sense to have a by-column output for univariate methods, but, by definition, not for multivariate ones.
- Another downside to the current approach is when using
method = "all"
, because then the output will be very long. Perhaps we could only print (and compute) the second per-variable part with an optional argument,detailed = TRUE
(or the like) passed tocheck_outliers
. - Another possibility would be to print the long output only if a single method is selected like suggested in point 3.
Looking forward to your comments and feedback.
Codecov Report
Merging #443 (2f04137) into main (23d81d0) will decrease coverage by
1.09%
. The diff coverage is6.55%
.
@@ Coverage Diff @@
## main #443 +/- ##
==========================================
- Coverage 32.67% 31.58% -1.10%
==========================================
Files 80 80
Lines 4682 5047 +365
==========================================
+ Hits 1530 1594 +64
- Misses 3152 3453 +301
Impacted Files | Coverage Δ | |
---|---|---|
R/item_intercor.R | 83.33% <ø> (ø) |
|
R/check_outliers.R | 9.30% <6.55%> (+9.30%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Thanks a lot, impressive PR! Indeed, it takes some time to read your explanation and look at the changes, but I'll try to do this the next days.
One point: please check if check_model()
resp. plot(check_outliers())` still works as expected.
Thanks so much. And no rush. We got time. And yes, I forgot to add the plot demo! Thank you for pointing that out. I have updated my reprex accordingly. 👍
Is there a reason you closed and deleted? @rempsyc
I'm sorry I realized I named my branch check_model
instead of check_outliers
, so I renamed it on my end thinking it would update here without breaking anything. Somehow it closed the PR! I'm not sure how to fix this... I just tried restoring the branch but I'm not sure it worked correctly.
The correct branch is here: https://github.com/rempsyc/performance/tree/check_outliers
However, I don't see how I can merge them back to this PR or whether I should open a new PR with the correct name. I'm afraid opening a new PR will lose the existing discussion here.
Maybe it's not a big deal to keep the wrong branch name after all. I'm sorry for this unexpected extra trouble!
Maybe use orange as the color, rather than red? Or let's pick a better red.
@strengejacke where are the insight colors specified?
?insight::print_color
provides the following information:
colour
Character vector, indicating the colour for printing. May be one of "red", "yellow", "green",
"blue", "violet", "cyan" or "grey".
So there is no orange (yet). It would indeed be nice to know how to add more colours. insight::print_color
is basically only defined as cat(.colour(colour = color, x = text))
, but I can find no information with ?.colour
. Personally though, I would probably go with orange, else for a better red, I would choose a lighter one, like almost pinkish.
I like the default R plotting palette col = 2
. That's the red I picked out when R updated its colors in 4.0
Ok colours are defined here: https://github.com/easystats/insight/blob/18b5aaee8735ff97022b045cd4d81ffe7e207ff2/R/colour_tools.R
E.g. .colour
is:
.colour <- function(colour = "red", x) {
switch(colour,
red = .red(x),
yellow = .yellow(x),
green = .green(x),
[...]
)
}
And individual colours each have their own function, e.g.:
.red <- function(x) {
if (.supports_color()) {
x[!is.na(x)] <- paste0("\033[31m", x[!is.na(x)], "\033[39m")
}
x
}
So definitely seems possible to add more choices. Should I attempt a PR to insight
to add colour col = 2
?
Maybe let's just swap the current color palette (which I think is the default color palette used by {cli}
for one of the other palettes that is more accessible? Maybe cli::ansi_palette_show("vscode")
?
![image](https://user-images.githubusercontent.com/4773225/181657970-e391fef7-a78d-471b-808c-b2f01fe1c170.png)
See cli::ansi_palettes["vscode",]
I think the bright colors there look pretty good on both light and dark backgrounds.
Another option that might be even better would be to support {cli}
's option getOption("cli.palette")
. If that is specified, we could call the relevant {cli}
function and use the user-specified palette rather than the default.
In any event, the color printing discussion can move to {insight}
. Let's get this one merged.
Another great enhancement in the future would be for this function to pull loo results for Stan based models
Should I convert this PR to a draft to avoid an accidental merge? Given that there are still questions/issues to be resolved. Perhaps it would be helpful if I were to rephrase my earlier questions as simplified suggestions instead? Here’s an attempt:
- [x] 1. Yes (include thresholds on same line as methods)
- [x] 2. OK to leave like this (always use plural)
- [x] 3. Yes (don’t print detailed output when > 1 method selected)
- [x] 4. Yes (add row and ID to outlier info data frame)
- [x] 5. Yes (don’t print detailed output when > 1 method selected)
Also note that lists/dataframes can’t be printed with insight::print_color
so they print white instead of yellow/red like the non-detailed output.
Still, I would like to receive the “green light” before moving forward with the rest of the changes in case this is not the outcome you desire. Thoughts?
Also note that lists/dataframes can’t be printed with insight::print_color so they print white instead of yellow/red like the non-detailed output.
That's good. A whole data frame would be too much in color I think
Sorry I didn't see the questions at the bottom of the first post.
- Same line like you suggest is fine
- Fine to leave plural for now. We should make an insight function for pluralizing words that we can use here and elsewhere @strengejacke
- Sure I think omitting detailed output is fine then. What all would show and not show? If we do that, do we want to default to a single method like Cook's D/LOO?
- I'm not sure what this question means.
- Agreed let's reduce output per (3)
Great! I'll start working on that. For 3, sorry that it wasn't clear. I'm referring to the demo from the example in the help file:
library(performance)
# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")
# Add ID information
outliers_list <- performance::check_outliers(
data, method = c("mahalanobis", "mcd", "zscore"), ID = "car")
# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
#> Distance_Zscore Outlier_Zscore Distance_Mahalanobis Outlier_Mahalanobis
#> 1 1.189901 0 8.946673 0
#> 2 1.189901 0 8.287933 0
#> 3 1.224858 0 8.937150 0
#> 4 1.122152 0 6.096726 0
#> 5 1.043081 0 5.429061 0
#> 6 1.564608 0 8.877558 0
#> Distance_MCD Outlier_MCD Outlier
#> 1 11.508353 0 0
#> 2 8.618865 0 0
#> 3 12.265382 0 0
#> 4 14.351997 0 0
#> 5 8.639128 0 0
#> 6 12.003840 0 0
Created on 2022-07-28 by the reprex package (v2.0.1)
So here we see that the data frame resulting from as.data.frame(outliers_list)
does not contain the ID information, although it was requested. Should I add it there as well, i.e., as a unique column at the very beginning? Furthermore, the row number is contained as the row names, but sometimes it is useful to have it as a column as well.
Ultimately, that data frame is similar to my detailed list output, except that it prints the information for all observations, not only outliers. Thus why I was thinking of adding ID/row there as well for consistency.
Sure I think omitting detailed output is fine then. What all would show and not show? If we do that, do we want to default to a single method like Cook's D/LOO?
If we are to print the detailed output for method = all
, then the Outliers per variable (univariate methods)
section would repeat again as many times as there are univariate methods (and would be renamed accordingly for each method, e.g., Outliers per variable (z-score)
). However, if we are to only print detailed output when a single method is selected, then the detailed section will not be visible.
I don't think we need to change the default methods. Sure, most people using multiple methods might never see the changes and detailed output, but I think that's ok since it would mostly be useful to those using single methods anyway. Plus, default method for class numeric is already a single univariate method (zscore_robust).
I think reduced output with "all" is good. And let's add the id as a column
@rempsyc @bwiernik please check if my latest changes (https://github.com/easystats/performance/pull/443/commits/c14396c55ad21843687099fe60feb489a7c095bc) are ok.
I think that's correct but @rempsyc would know better
Maybe use orange as the color, rather than red? Or let's pick a better red.
@strengejacke where are the insight colors specified?
Colours are defined in https://github.com/easystats/insight/blob/main/R/colour_tools.R
We also have a function called color_theme()
, which returns the currently used theme (https://github.com/easystats/insight/blob/main/R/print_color.R), however this doesn't work out RStudio, I think.
I'm going to revert that last commit (fix checkk issues, c14396c
) for now because I have changed the code substantially since my initial commit. The failed tests are normal because I had only implemented the new output for one method as a sort of proof of concept to get approval for larger changes. This will be fixed shortly.
I am still working on it (it's huge). But I'm almost done.
Edit: I cannot find how to do a revert commit from RStudio or from the GitHub website. Hum.
I think the commits should be reverted now.
Kind of long reprex, but here we go!
Reprex
# Load package
devtools::load_all()
#> ℹ Loading performance
# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")
- Singular (vs. plural) is supported for “outlier”, “case”, “method”, “threshold”, and “variable”.
- Thresholds are now included in parenthesis next to the method.
- When a single variable or object is passed, the variable or object name for numeric vectors is now also passed through
sys.call
.
performance::check_outliers(data$mpg, method = "zscore", threshold = 2.2)
#> 1 outlier detected: case 20.
#> - Based on the following method and threshold: zscore (2.2).
#> - For variable: data$mpg.
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore):
#>
#> $`data$mpg`
#> Row Distance_Zscore
#> 20 20 2.291272
- Repeated outliers are also flagged in a special count/frequency table, if any, with a count of how many variables they were flagged as outlier for.
performance::check_outliers(data, method = "zscore", threshold = 2.7)
#> 2 outliers detected: cases 9, 31.
#> - Based on the following method and threshold: zscore (2.7).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_Zscore
#> 1 31 2
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore):
#>
#> $hp
#> Row Distance_Zscore
#> 31 31 2.746567
#>
#> $qsec
#> Row Distance_Zscore
#> 9 9 2.826755
#>
#> $carb
#> Row Distance_Zscore
#> 31 31 3.211677
- The count/frequency table also supports multiple methods.
- Outliers per variable are not printed when more than one method is selected to avoid excessively long outputs.
x <- performance::check_outliers(data, method = c("zscore", "iqr"))
x
#> 6 outliers detected: cases 9, 15, 16, 17, 20, 31.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_Zscore n_IQR
#> 1 31 2 2
- However, they are accessible through attributes if needed.
attributes(x)$outlier_var$zscore
#> $mpg
#> Row Distance_Zscore
#> 18 18 2.042389
#> 20 20 2.291272
#>
#> $hp
#> Row Distance_Zscore
#> 31 31 2.746567
#>
#> $drat
#> Row Distance_Zscore
#> 19 19 2.493904
#>
#> $wt
#> Row Distance_Zscore
#> 15 15 2.077505
#> 16 16 2.255336
#> 17 17 2.174596
#>
#> $qsec
#> Row Distance_Zscore
#> 9 9 2.826755
#>
#> $carb
#> Row Distance_Zscore
#> 30 30 1.973440
#> 31 31 3.211677
attributes(x)$outlier_var$iqr
#> $mpg
#> Row Distance_IQR
#> 20 20 1
#>
#> $hp
#> Row Distance_IQR
#> 31 31 1
#>
#> $wt
#> Row Distance_IQR
#> 15 15 1
#> 16 16 1
#> 17 17 1
#>
#> $qsec
#> Row Distance_IQR
#> 9 9 1
#>
#> $carb
#> Row Distance_IQR
#> 31 31 1
- The count/frequency table is not printed when a single, multivariate method is selected (since it would be redundant).
performance::check_outliers(data, method = "mahalanobis")
#> 1 outlier detected: case 9.
#> - Based on the following method and threshold: mahalanobis (21.92).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
- Multivariate methods also are integrated in the count/frequency table, along the univariate methods. It does look a bit monstrous when using all the methods at once. But in most situations it should work as expected.
performance::check_outliers(data, method = c(
"zscore", "zscore_robust", "iqr", "ci", "eti", "hdi", "bci", "mahalanobis",
"mahalanobis_robust", "mcd", "ics", "optics", "lof"))
#> 3 outliers detected: cases 9, 15, 31.
#> - Based on the following methods and thresholds: zscore (1.96), zscore_robust (1.96), iqr (1.5), ci (0.95), eti (0.95), hdi (0.95), bci (0.95), mahalanobis (21.92), mahalanobis_robust (21.92), mcd (21.92), ics (0.03), optics (22), lof (0.03).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_Zscore n_Zscore_robust n_IQR n_ci n_eti n_bci n_Mahalanobis
#> 1 31 2 4 2 2 2 3 0
#> 2 9 1 2 1 1 1 1 (Multivariate)
#> 3 15 1 2 1 1 1 2 0
#> 4 16 1 1 1 1 1 2 0
#> 5 18 1 3 0 0 0 0 0
#> 6 19 1 4 0 2 2 3 0
#> 7 20 1 3 1 2 2 2 0
#> 8 30 1 2 0 0 0 0 0
#> 9 28 0 4 0 1 1 1 0
#> 10 3 0 2 0 0 0 0 0
#> 11 26 0 2 0 0 0 0 0
#> 12 29 0 2 0 1 1 1 0
#> 13 32 0 2 0 0 0 0 0
#> 14 4 0 1 0 0 0 0 0
#> 15 8 0 1 0 0 0 0 0
#> 16 21 0 1 0 0 0 0 0
#> 17 27 0 1 0 0 0 1 0
#> 18 7 0 0 0 0 0 0 0
#> 19 24 0 0 0 0 0 0 0
#> n_Mahalanobis_robust n_MCD n_ICS n_LOF
#> 1 (Multivariate) (Multivariate) 0 (Multivariate)
#> 2 (Multivariate) (Multivariate) (Multivariate) 0
#> 3 0 0 0 (Multivariate)
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 (Multivariate) 0 0
#> 7 0 0 0 0
#> 8 0 (Multivariate) 0 (Multivariate)
#> 9 (Multivariate) (Multivariate) 0 0
#> 10 0 0 0 0
#> 11 0 0 0 0
#> 12 (Multivariate) 0 (Multivariate) 0
#> 13 0 0 0 0
#> 14 0 0 0 (Multivariate)
#> 15 (Multivariate) (Multivariate) 0 0
#> 16 (Multivariate) (Multivariate) 0 0
#> 17 (Multivariate) (Multivariate) 0 0
#> 18 (Multivariate) (Multivariate) 0 0
#> 19 (Multivariate) (Multivariate) 0 0
- ID is supported for all those methods as well.
performance::check_outliers(data, method = c(
"zscore", "zscore_robust", "iqr", "ci", "eti", "hdi", "bci", "mahalanobis",
"mahalanobis_robust", "mcd", "ics", "optics", "lof"), ID = "car")
#> 3 outliers detected: cases 9, 15, 31.
#> - Based on the following methods and thresholds: zscore (1.96), zscore_robust (1.96), iqr (1.5), ci (0.95), eti (0.95), hdi (0.95), bci (0.95), mahalanobis (21.92), mahalanobis_robust (21.92), mcd (21.92), ics (0.03), optics (22), lof (0.03).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row car n_Zscore n_Zscore_robust n_IQR n_ci n_eti n_bci
#> 1 31 Maserati Bora 2 4 2 2 2 3
#> 2 9 Merc 230 1 2 1 1 1 1
#> 3 15 Cadillac Fleetwood 1 2 1 1 1 2
#> 4 16 Lincoln Continental 1 1 1 1 1 2
#> 5 18 Fiat 128 1 3 0 0 0 0
#> 6 19 Honda Civic 1 4 0 2 2 3
#> 7 20 Toyota Corolla 1 3 1 2 2 2
#> 8 30 Ferrari Dino 1 2 0 0 0 0
#> 9 28 Lotus Europa 0 4 0 1 1 1
#> 10 3 Datsun 710 0 2 0 0 0 0
#> 11 26 Fiat X1-9 0 2 0 0 0 0
#> 12 29 Ford Pantera L 0 2 0 1 1 1
#> 13 32 Volvo 142E 0 2 0 0 0 0
#> 14 4 Hornet 4 Drive 0 1 0 0 0 0
#> 15 8 Merc 240D 0 1 0 0 0 0
#> 16 21 Toyota Corona 0 1 0 0 0 0
#> 17 27 Porsche 914-2 0 1 0 0 0 1
#> 18 7 Duster 360 0 0 0 0 0 0
#> 19 24 Camaro Z28 0 0 0 0 0 0
#> n_Mahalanobis n_Mahalanobis_robust n_MCD n_ICS
#> 1 0 (Multivariate) (Multivariate) 0
#> 2 (Multivariate) (Multivariate) (Multivariate) (Multivariate)
#> 3 0 0 0 0
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 0 0 (Multivariate) 0
#> 7 0 0 0 0
#> 8 0 0 (Multivariate) 0
#> 9 0 (Multivariate) (Multivariate) 0
#> 10 0 0 0 0
#> 11 0 0 0 0
#> 12 0 (Multivariate) 0 (Multivariate)
#> 13 0 0 0 0
#> 14 0 0 0 0
#> 15 0 (Multivariate) (Multivariate) 0
#> 16 0 (Multivariate) (Multivariate) 0
#> 17 0 (Multivariate) (Multivariate) 0
#> 18 0 (Multivariate) (Multivariate) 0
#> 19 0 (Multivariate) (Multivariate) 0
#> n_LOF
#> 1 (Multivariate)
#> 2 0
#> 3 (Multivariate)
#> 4 0
#> 5 0
#> 6 0
#> 7 0
#> 8 (Multivariate)
#> 9 0
#> 10 0
#> 11 0
#> 12 0
#> 13 0
#> 14 (Multivariate)
#> 15 0
#> 16 0
#> 17 0
#> 18 0
#> 19 0
- It still supports models
model <- lm(disp ~ mpg + hp, data = data)
check_outliers(model)
#> 1 outlier detected: case 31.
#> - Based on the following method and threshold: cook (0.81).
#> - For variable: (Whole model).
- And multiple methods including model methods:
check_outliers(model, method = c("cook", "optics", "lof"))
#> 1 outlier detected: case 31.
#> - Based on the following methods and thresholds: cook (0.81), optics (6), lof (0.03).
#> - For variable: (Whole model).
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_OPTICS n_LOF n_Cook
#> 1 31 (Multivariate) 0 (Multivariate)
#> 2 6 0 (Multivariate) 0
#> 3 10 0 (Multivariate) 0
#> 4 11 0 (Multivariate) 0
#> 5 22 0 (Multivariate) 0
#> 6 23 0 (Multivariate) 0
#> 7 28 0 (Multivariate) 0
- Bayesian models…
suppressMessages(library(rstanarm))
invisible(capture.output(model <- stan_glm(mpg ~ qsec + wt, data = data)))
check_outliers(model, method = "pareto", threshold = list("pareto" = 0.4))
#> 3 outliers detected: cases 9, 18, 20.
#> - Based on the following method and threshold: pareto (0.4).
#> - For variable: (Whole model).
- Multiple methods including Bayesian models:
check_outliers(model, method = c("pareto", "optics", "lof"),
threshold = list("pareto" = 0.4))
#> 1 outlier detected: case 9.
#> - Based on the following methods and thresholds: pareto (0.4), optics (6), lof (0.03).
#> - For variable: (Whole model).
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_OPTICS n_LOF n_Pareto
#> 1 9 (Multivariate) (Multivariate) (Multivariate)
#> 2 18 0 0 (Multivariate)
#> 3 20 0 0 (Multivariate)
- When using grouped data frames, the attributes are stored differently (by group). The outlier info is printed for each group. Because
check_outliers
is applied individually to each group, the default was that rows resetted for each group (e.g., group 1 = 1:50, group 2 = 1:50, etc.). However, I realized that this could be confusing (if one wants to use that for decision making) so I have added an a posteriori correction to row numbers so they reflect the original data set instead.
suppressMessages(library("poorman"))
data.group <- iris %>%
group_by(Species)
z <- check_outliers(data.group, method = c("zscore", "iqr"))
z
#> 13 outliers detected: cases 14, 16, 23, 24, 25, 42, 44, 45, 99, 107, 118, 120, 132.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5).
#> - For variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> $setosa
#> Row n_Zscore n_IQR
#> 1 14 2 1
#> 2 16 2 1
#>
#> $versicolor
#> Row n_Zscore n_IQR
#> 1 58 2 0
#>
#> $virginica
#> Row n_Zscore n_IQR
#> 1 118 2 1
#> 2 132 2 1
- So specific per-variable info for a specific group and method may be obtained in the following way:
attributes(z)$outlier_var$versicolor$zscore
#> $Sepal.Length
#> Row Distance_Zscore
#> 1 51 2.061332
#> 8 58 2.007086
#>
#> $Sepal.Width
#> Row Distance_Zscore
#> 11 61 2.453805
#> 36 86 2.007659
#>
#> $Petal.Length
#> Row Distance_Zscore
#> 8 58 2.042940
#> 44 94 2.042940
#> 49 99 2.681359
#>
#> $Petal.Width
#> Row Distance_Zscore
#> 21 71 2.396933
- Outlier data now includes group as it is more informative.
head(attributes(z)$data)
#> Row Species Distance_Zscore Outlier_Zscore Distance_IQR Outlier_IQR Outlier
#> 1 1 setosa 1.000000 0 0 0 0
#> 2 2 setosa 1.129096 0 0 0 0
#> 3 3 setosa 1.000000 0 0 0 0
#> 4 4 setosa 1.151807 0 0 0 0
#> 5 5 setosa 1.000000 0 0 0 0
#> 6 6 setosa 1.461300 0 0 0 0
- Also supports single methods of course:
check_outliers(data.group, method = "zscore")
#> 25 outliers detected: cases 14, 15, 16, 19, 23, 24, 25, 34, 42, 44, 45, 51, 58, 61, 71, 86, 94, 99, 107, 118, 119, 120, 123, 132, 135.
#> - Based on the following method and threshold: zscore (1.96).
#> - For variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> $setosa
#> Row n_Zscore
#> 1 14 2
#> 2 16 2
#>
#> $versicolor
#> Row n_Zscore
#> 1 58 2
#>
#> $virginica
#> Row n_Zscore
#> 1 118 2
#> 2 132 2
#>
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore):
#>
#> $setosa
#> $setosa$zscore
#> $setosa$zscore$Sepal.Length
#> Row Distance_Zscore
#> 14 14 2.002895
#> 15 15 2.252548
#> 16 16 1.968852
#> 19 19 1.968852
#>
#> $setosa$zscore$Sepal.Width
#> Row Distance_Zscore
#> 16 16 2.564208
#> 34 34 2.036593
#> 42 42 2.975748
#>
#> $setosa$zscore$Petal.Length
#> Row Distance_Zscore
#> 14 14 2.084485
#> 23 23 2.660310
#> 25 25 2.522112
#> 45 45 2.522112
#>
#> $setosa$zscore$Petal.Width
#> Row Distance_Zscore
#> 24 24 2.410197
#> 44 44 3.359093
#>
#>
#>
#> $versicolor
#> $versicolor$zscore
#> $versicolor$zscore$Sepal.Length
#> Row Distance_Zscore
#> 1 51 2.061332
#> 8 58 2.007086
#>
#> $versicolor$zscore$Sepal.Width
#> Row Distance_Zscore
#> 11 61 2.453805
#> 36 86 2.007659
#>
#> $versicolor$zscore$Petal.Length
#> Row Distance_Zscore
#> 8 58 2.042940
#> 44 94 2.042940
#> 49 99 2.681359
#>
#> $versicolor$zscore$Petal.Width
#> Row Distance_Zscore
#> 21 71 2.396933
#>
#>
#>
#> $virginica
#> $virginica$zscore
#> $virginica$zscore$Sepal.Length
#> Row Distance_Zscore
#> 7 107 2.654591
#> 32 132 2.063284
#>
#> $virginica$zscore$Sepal.Width
#> Row Distance_Zscore
#> 18 118 2.561267
#> 20 120 2.400025
#> 32 132 2.561267
#>
#> $virginica$zscore$Petal.Length
#> Row Distance_Zscore
#> 18 118 2.080107
#> 19 119 2.442495
#> 23 123 2.080107
#>
#> $virginica$zscore$Petal.Width
#> Row Distance_Zscore
#> 35 135 2.279264
- BFBayesFactor support:
suppressMessages(library(BayesFactor))
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "numericVector" of class "Mnumeric"; definition not updated
output <- regressionBF(rating ~ ., data = attitude, progress=FALSE)
check_outliers(output, threshold = 15)
#> 1 outlier detected: case 18.
#> - Based on the following method and threshold: mahalanobis (15).
#> - For variables: complaints, privileges, learning, raises, critical, advance, rating.
- BFBayesFactor, multiple methods:
check_outliers(output, method = c("zscore", "iqr", "mcd"))
#> 5 outliers detected: cases 6, 14, 16, 24, 26.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5), mcd (16.01).
#> - For variables: complaints, privileges, learning, raises, critical, advance, rating.
#>
#> Note: Outliers were classified as such by at least half of the selected methods.
#>
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables
#> by at least one of the selected methods:
#>
#> Row n_Zscore n_IQR n_MCD
#> 1 21 2 0 0
#> 2 24 2 0 (Multivariate)
#> 3 26 2 1 0
#> 4 14 1 0 (Multivariate)
#> 5 16 1 0 (Multivariate)
#> 6 9 0 0 (Multivariate)
#> 7 18 0 0 (Multivariate)
- gls support:
library(nlme)
fm1 <- gls(follicles ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary,
correlation = corAR1(form = ~ 1 | Mare))
check_outliers(fm1, method = "zscore_robust", threshold = list(zscore_robust = 2.2))
#> 10 outliers detected: cases 15, 43, 97, 126, 155, 183, 212, 240, 267, 295.
#> - Based on the following method and threshold: zscore_robust (2.2).
#> - For variable: (Whole model).
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore_robust):
#>
#> $`cos(2 * pi * Time)`
#> Row Distance_Zscore_robust
#> 15 15 2.201852
#> 43 43 2.201852
#> 97 97 2.201852
#> 126 126 2.201852
#> 155 155 2.201852
#> 183 183 2.201852
#> 212 212 2.201852
#> 240 240 2.201852
#> 267 267 2.201852
#> 295 295 2.201852
- gls already did not support multiple methods:
check_outliers(fm1, method = c("zscore_robust", "iqr"))
#> Error in if (!method %in% valid_methods) {: the condition has length > 1
check_outliers(fm1, method = "all")
#> Error in if (!method %in% valid_methods) {: the condition has length > 1
- pareto method is not working properly with gls (in my testing).
check_outliers(fm1, method = "pareto", threshold = list(pareto = 0))
#> Converting missing values (`NA`) into regular values currently not
#> possible for variables of class 'NULL'.
#> OK: No outliers detected.
#> - Based on the following method and threshold: ().
#> - For variable: (Whole model)
This is because pareto has a special condition to trigger:
if ("pareto" %in% method & insight::model_info(x)$is_bayesian) {
And if we check, we realize that the fm1 model is not Bayesian:
insight::model_info(fm1)$is_bayesian
#> [1] FALSE
Therefore, I would need some pointers to continue troubleshooting this. For example, an example of a correct and fully compatible gls model. This behaviour is not new: in the previous version, it could also not find outliers, regardless of the pareto threshold (even = 0) (on this model).
Created on 2022-08-12 by the reprex package (v2.0.1)
Comments
-
- [x] ~~I initially wrapped the coloured messages in
insight::format_message
since they can get pretty long with many methods. However, it seems like it is giving strange line break behaviours like jumping random line when less than half of my console is used for the total printed string line and when my console is more than 3/4 of my screen. So I ended up removing it. Demo:~~ [MOVED TO easystats/insight#610]
- [x] ~~I initially wrapped the coloured messages in
-
- [x] ~~When programming with
datawizard::data_filter
, it does not allow me to use character strings for the logical expression of the filter argument. I could not make it work withdeparse(substitute())
, double curly brackets{{ }}
, or the bang bang operator!!
. So I had to resort to base R instead. From the documentation I saw thatdata_filter
is based onsubset
, which itself has the following warning: > This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like[
, and in particular the non-standard evaluation of argumentsubset
can have unanticipated consequences.~~ [MOVED TO easystats/datawizard#216]
- [x] ~~When programming with
-
- [x] I was getting a warning with
method = lof
. I have thus adapted the function to use theminPts
argument instead (and checked that the results were the same; they are).
- [x] I was getting a warning with
#> In dbscan::lof(x, k = ncol(x) - 1) :
#> lof: k is now deprecated. use minPts = 11 instead .
-
- [x] There was a single ci threshold for all four ci methods ci, hdi, eti, and bci. This created some problems down the line, so I gave them their own threshold.
-
- [x] I was not too sure why the thresholds attribute inherits of all the possible thresholds rather than the thresholds of the respective methods. So I have changed it to only selected methods.
-
- [x] Similarly, model attributes inherit from both
cook
andpareto
methods, but only one is used (based on whether it is a Bayesian model or not). So I changed it so that the inherited model attribute is also the one selected.
- [x] Similarly, model attributes inherit from both
-
- [x] In the attributes, all the
data_method
were named in lower case, whereas ICS was in capitals. To make it consistent with the rest of the methods, I changed it to lower case. I find it easier to program and debug when everything is consistent.
- [x] In the attributes, all the
-
- [x] ~~For the IQR and CI methods, an average is calculated for the distance (e.g.,
Distance_IQR
), whereas the max is used for the zscore. I wonder if that is considered inconsistent, or if it is intentional because the IQR scores seem to be only 0 or 1 (so is it more informative to have the average than just 1?). In any case, I find it a bit strange since people might expect the same logic to be applied across the different methods. And using the average will be affected by the number of columns.~~ [MOVED TO #467]
- [x] ~~For the IQR and CI methods, an average is calculated for the distance (e.g.,
-
- [x] ~~There are discrepancies within the method thresholds in different places within the code. So for example, the iqr threshold is set to 1.96 in
thresholds
, but to 1.5 per default in.check_outliers_iqr
. What is also strange is that it seems like the iqr values don’t get bigger than 1 anyway?~~ [MOVED TO #467]
- [x] ~~There are discrepancies within the method thresholds in different places within the code. So for example, the iqr threshold is set to 1.96 in
-
- [ ] For
method = “ics”
, there is atryCatch
call at some point. I don’t really know much about this, but I thought it was used interactively when trying to debug but that it usually wasn’t a normal part of mature functions. Does that mean that this code bit is a vestige from unfinished debugging?
- [ ] For
-
- [x] ~~I haven’t touched
method = “iforest”
because it was commented. I tried uncommenting it to see if I could figure out how to solve the issue but I wasn’t easily able to, so I commented it again.~~ [MOVED TO #470]
- [x] ~~I haven’t touched
-
- [x] ~~For
method = “lof”
, there is the following note:# TODO: use tukey_mc from bigutilsr package
.~~ [MOVED TO #469]
- [x] ~~For
-
- [x] ~~For
method = “optics”
, there is the following note:# TODO: find automatic way of setting 'xi'
~~ [MOVED TO #468]
- [x] ~~For
-
- [x] ~~With
method = optics
, I am not able to detect outliers on the mtcars dataset with the default threshold (2 * ncol(x) = 22
), whereas increasing the threshold beyond the default results in a warning (this might be related to the previous point).~~ [MOVED TO #468]
- [x] ~~With
-
- [x] ~~I think the default thresholds for zscore methods (threshold = 1.96) are conservative. I wonder whether we should follow the more conventional 3 deviations instead. I don’t know about the appropriateness of the thresholds of the other methods however.~~ [MOVED TO #467]
-
- [x] In the documentation, the ellipsis is described as below. However,
.check_outliers_mahalanobis
also seems to make use of the ellipsis. So I have updated the documentation accordingly. :
When
method = "ics"
, further arguments in...
are passed down toICSOutlier::ics.outlier()
. - [x] In the documentation, the ellipsis is described as below. However,
-
- [x] It is not currently possible to have ID names included when feeding a model because the data extracted from the model does not contain the full original dataframe including that information. In fact, right now, attempting to add the ID argument when feeding a model throws an error, and I’m not too sure how to solve it because it is passed in the ellipsis. My attempts at transforming the ellipsis into a list and feeding it back to the respective function have so far failed. However, objects of class “model” are currently not compatible with
method = ics
ormethod = mahalanobis
(along eithercook
orpareto
), so I have simply removed the ellipsis for objects of class ‘model’.
- [x] It is not currently possible to have ID names included when feeding a model because the data extracted from the model does not contain the full original dataframe including that information. In fact, right now, attempting to add the ID argument when feeding a model throws an error, and I’m not too sure how to solve it because it is passed in the ellipsis. My attempts at transforming the ellipsis into a list and feeding it back to the respective function have so far failed. However, objects of class “model” are currently not compatible with
#> check_outliers(model, method = c("ics", "cook"))
#> "check_outliers()" does not support models of class "data.frame".
#> Error in .check_outliers_ics(x, threshold = thresholds$ics) :
#> trying to get slot "ics.dist.cutoff" from an object of a basic class ("NULL") with no slots
#>
#> check_outliers(model, method = c("mahalanobis", "cook"))
#> Error in solve.default(cov, ...) :
#> Lapack routine dgesv: system is exactly singular: U[1,1] = 0
So I had it throw a warning if people attempt it anyway:
#> Warning message:
#> In check_outliers.default(model, method = 'cook', ID = ID.names) :
#> ID argument not supported with objects of class 'model'
-
- [x] There is a test failing (there is only one test) but I can’t figure out why since they seem to be correctly outputting the same error message. [fixed by strengejacke]
-
- [x] ~~Colleagues I have convinced of using
check_outliers
have wondered why mahalanobis never finds any missing data. It seems that in the presence of NA values, something goes wrong, even with the most strict of thresholds. So turns out that colleagues might have erroneously reported no outliers just because they had a single missing value because there is no appropriate warning to this effect. This outcome is not new; the cause seems to lie within the base R mahalanobis function. How would you suggest addressing this? Should we throw a warning and usena.omit
or a variant thereof, or just throw an error and ask people to deal with it beforehand? Seems like the way other multivariate methods “deal” with it, so that’s what I did for now.~~ [MOVED TO #466]
- [x] ~~Colleagues I have convinced of using
-
- [x] ~~When using
datawizard
select helpers within functions, it triggers R CMD check warnings e.g.,check_outliers.data.frame: no visible global function definition for ‘contains’
because they’re not functions per se (I don’t know what they are actually). What’s the proper way to fix this warning? I could not find a single instance of the select helpers in all ofperformance
’s functions.~~ [MOVED TO easystats/datawizard#218]
- [x] ~~When using
-
There are other warnings but I will wait to get your feedback before we discuss the other ones because this is getting pretty long.
Created on 2022-08-12 by the reprex package (v2.0.1)
Ok I tried moving as many points as possible to their respective issues, but please let me know if you think some of these other points would deserve their own issue, and I will make the transfer. I have also added check-boxes to know what is a simple note or explanation of what I have changed, and what possibly requires external input. I hope this new organization will make it more chewable and easier to manage.
Thanks a lot, that's really impressive! I'll try to look at this PR asap.
This is a lot of detail and work! Wow! If anyone wants me to look at something, can you @ me with a pointer to the spot?
There is a test failing (there is only one test) but I can’t figure out why since they seem to be correctly outputting the same error message.
The message is formatted using insight::format_message()
. In the test environment, you have a certain line length, so it's likely that the output which is tested against has multiple lines. I shortened the string, should work now.
I think the warnings need to be addressed, in particular the usage of data_filter()
probably needs to be replaced by base R, since we don't want to define global variables.