performance icon indicating copy to clipboard operation
performance copied to clipboard

`check_outlier` improvement (easystats/datawizard#177)

Open rempsyc opened this issue 2 years ago • 25 comments

Context

This is a pull request aiming to improve the printing method of check_outliers, based on easystats/datawizard#177.

Specifically, it aims to accomplish the following in the print output: (a) state the methods used; (b) state the thresholds used; and (c) state the variables tested. Additionally, it also aims to (d) report outliers per variable (for univariate methods), (e) report whether any observation comes up as outlier for several variables (when that is the case), and (f) include an optional ID variable along the row information. The changes were inspired by rempsyc::find_mad.

This is a prototype/proof of concept. (a) to (c) were implemented for all methods, but (d) to (f) were only implemented for method "zscore" for now. Before working on this further, I would like to get feedback to know whether it is worth implementing for other methods, and if modifications are needed before proceeding (as I would need to adapt the code to each method individually).

Reprex

Reprex demo of the changes below:

# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")

# Basic test
performance::check_outliers(data, method = c("mahalanobis", "mcd", "zscore"))
#> 4 outliers detected: cases 9, 19, 30, 31.
#> - Based on the following methods: mahalanobis, mcd, zscore.
#> - Using the following thresholds: 21.92, 21.92, 1.96.
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> ------------------------------------------------------------------------
#> The following observations were considered outliers for more than one variable by the univariate methods: 
#> 
#>   Row n_Zscore
#> 9  31        2
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (univariate methods): 
#> 
#> $mpg
#>   Row Distance_Zscore
#> 1  18        2.042389
#> 2  20        2.291272
#> 
#> $hp
#>   Row Distance_Zscore
#> 1  31        2.746567
#> 
#> $drat
#>   Row Distance_Zscore
#> 1  19        2.493904
#> 
#> $wt
#>   Row Distance_Zscore
#> 1  15        2.077505
#> 2  16        2.255336
#> 3  17        2.174596
#> 
#> $qsec
#>   Row Distance_Zscore
#> 1   9        2.826755
#> 
#> $carb
#>   Row Distance_Zscore
#> 1  30        1.973440
#> 2  31        3.211677

# Add ID information
outliers_list <- performance::check_outliers(
  data, method = c("mahalanobis", "mcd", "zscore"), ID = "car")
outliers_list
#> 4 outliers detected: cases 9, 19, 30, 31.
#> - Based on the following methods: mahalanobis, mcd, zscore.
#> - Using the following thresholds: 21.92, 21.92, 1.96.
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> ------------------------------------------------------------------------
#> The following observations were considered outliers for more than one variable by the univariate methods: 
#> 
#>   Row           car n_Zscore
#> 9  31 Maserati Bora        2
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (univariate methods): 
#> 
#> $mpg
#>   Row            car Distance_Zscore
#> 1  18       Fiat 128        2.042389
#> 2  20 Toyota Corolla        2.291272
#> 
#> $hp
#>   Row           car Distance_Zscore
#> 1  31 Maserati Bora        2.746567
#> 
#> $drat
#>   Row         car Distance_Zscore
#> 1  19 Honda Civic        2.493904
#> 
#> $wt
#>   Row                 car Distance_Zscore
#> 1  15  Cadillac Fleetwood        2.077505
#> 2  16 Lincoln Continental        2.255336
#> 3  17   Chrysler Imperial        2.174596
#> 
#> $qsec
#>   Row      car Distance_Zscore
#> 1   9 Merc 230        2.826755
#> 
#> $carb
#>   Row           car Distance_Zscore
#> 1  30  Ferrari Dino        1.973440
#> 2  31 Maserati Bora        3.211677

# Since only the printing method is modified, old features still work:

# The object is a binary vector...
filtered_data <- data[!outliers_list, ] # And can be used to filter a dataframe
nrow(filtered_data) # New size, 28 (4 outliers removed)
#> [1] 28

# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
#>   Distance_Zscore Outlier_Zscore Distance_Mahalanobis Outlier_Mahalanobis
#> 1        1.189901              0             8.946673                   0
#> 2        1.189901              0             8.287933                   0
#> 3        1.224858              0             8.937150                   0
#> 4        1.122152              0             6.096726                   0
#> 5        1.043081              0             5.429061                   0
#> 6        1.564608              0             8.877558                   0
#>   Distance_MCD Outlier_MCD Outlier
#> 1    11.508353           0       0
#> 2     8.618865           0       0
#> 3    12.265382           0       0
#> 4    14.351997           0       0
#> 5     8.639128           0       0
#> 6    12.003840           0       0
outliers_info$Outlier # Including the probability of being an outlier
#>  [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333
#>  [8] 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.3333333 0.3333333 0.3333333 0.3333333 0.6666667 0.3333333 0.3333333
#> [22] 0.0000000 0.0000000 0.3333333 0.0000000 0.0000000 0.3333333 0.3333333
#> [29] 0.3333333 0.6666667 0.6666667 0.0000000

# For statistical models ---------------------------------------------
model <- lm(disp ~ mpg + hp, data = mtcars)
mod.outliers <- check_outliers(model)
mod.outliers
#> 1 outliers detected: cases 31.
#> - Based on the following methods: cook, pareto.
#> - Using the following thresholds: 0.81, 0.7.
#> - For variables: (Whole model)
#> 
#> Note: Outliers were classified as such by at least half of the selected methods.

# Check plots
plot(mod.outliers)

check_model(model)

# However, there seems to be a presentation issue when using a
# vector instead of a dataframe because then it is not possible to
# obtain the column name (since it has none), so it appears as x instead.

# Find all observations beyond +/- 2 SD
check_outliers(data$mpg, method = "zscore", threshold = 2)
#> 2 outliers detected: cases 18, 20.
#> - Based on the following methods: zscore.
#> - Using the following thresholds: 2.
#> - For variables: x

Created on 2022-07-01 by the reprex package (v2.0.1)

Observations

  • I got used to programming with dplyr, so it was a nice challenge attempting to convert everything to base R and datawizard. Feel free to make suggestions to improve the code.
  • There was a method called robust in thresholds referring to mahalanobis_robust, and I have changed it as such to avoid confusion with e.g., zscore_robust and to be consistent with method names to be able to refer to it back later. There was also a threshold called zscore but not one called zscore_robust, as the first one was used in both cases. Again, for clarity and compatibility with later code, I have made zscore_robust its own.
  • Personally, I don’t like the output printing in red, it’s difficult to read (I’m using a dark theme so the contrast isn’t good). The green is OK though. I tried modifying the red to a smoother red but it seems few colours are allowed with insight::print_color (seems a limitation of cat(); below)? In any case, I think it shouldn’t be red anyway, that would be more for errors, so I picked yellow for now, as I feel it is functionally close and much more readable.
In .colour(colour = color, x = text) : `color` #FF4040 not yet supported.
  • I got rid of the warning bit at the beginning of the output. It seems overkill since detecting outliers is the goal of the function, so it’s almost confusing ("is there something wrong with the outlier detection process?", one might wonder). It also adds text that adds no real information.
  • Currently, information about outliers belonging to which variables is not easily accessible. Thus, I had to apply e.g, .check_outliers_zscore again on individual columns with lapply.
  • As seen in the reprex, there seems to be a problem when using a vector instead of a dataframe because then it is not possible to obtain the column name (since it has none), so it appears as x instead. Perhaps there would be a way to make printing the variables line contingent on providing a dataframe.
  • I also corrected some minor typos.
  • The output formatting can be modified if you have a particular formatting format at the easyverse that I have missed. Open to suggestions for improvement.

What's next

  • [x] Implement per-variable output for each of the other univariate methods:
    • [x] "zscore"
    • [x] "zscore_robust"
    • [x] "iqr"
    • [x] "ci"
    • [x] "eti"
    • [x] "hdi"
    • [x] "bci"
  • [x] Integrate all univariate methods in the outlier frequency table.
  • [x] Also include multivariate detections in the outlier frequency table (maybe?) since column names don't need to be specified so that should make them compatible. That would mean adding support for all multivariate/model-specific methods:
    • [x] "cook"
    • [x] "pareto"
    • [x] "mahalanobis"
    • [x] "mahalanobis_robust"
    • [x] "mcd"
    • [x] "ics"
    • [x] "optics"
    • [ ] "iforest"
    • [x] "lof"
  • [x] Add support for grouped data frames
  • [x] Add support for check_outliers.BFBayesFactor
  • [x] Add support for check_outliers.gls

Questions

  1. Right now, the thresholds are displayed on a separate line. I was wondering if it would make sense to save one line by doing it instead like this: - Based on the following methods and thresholds: mahalanobis (21.92), iqr (1.5), zscore (1.96).
  2. At first, I was tempted to add (s) to all places where words could be either plural or singular, like so:
9 outlier(s) detected: ...
- Based on the following method(s): ...
- Using the following threshold(s): ...
- For variable(s): ...
  • But I felt it impacted readability because cases were already given in parentheses on the first line (so I switched the parentheses for a colon). Other possibilities would be to report the cases on its own line (and use (s)) or yet again adapt the function to print a different message depending on the number of cases/methods/variables. Might be overthinking and not at all necessary though.
  1. Since using multiple methods aims to reach a consensus (composite scores > 0.5), the number of outliers reported at the top can differ from the number of outliers per variable as reported at the bottom (for the univariate methods).
    • I think it might be a bit less interesting to get the detailed output when using several methods since there is already a decision protocol in place. Would it make more sense to only print that part when a single, univariate method is selected?
  2. Right now, outliers per variable are computed separately, but we could add the row and ID columns in the utilities section (.check_outliers_zscore, etc.) so that this info is also part of the outlier info data frame (outliers_info in the examples). Only if useful though.
  3. One of the challenges of adapting rempsyc::find_mad to check_outliers is that the former only uses one method (zscore_robust), whereas the latter needs to support multiple methods, which complicates the output formatting, especially for the per-variable section.
    • For example, it makes sense to have a by-column output for univariate methods, but, by definition, not for multivariate ones.
    • Another downside to the current approach is when using method = "all", because then the output will be very long. Perhaps we could only print (and compute) the second per-variable part with an optional argument, detailed = TRUE (or the like) passed to check_outliers.
    • Another possibility would be to print the long output only if a single method is selected like suggested in point 3.

Looking forward to your comments and feedback.

rempsyc avatar Jun 30 '22 21:06 rempsyc

Codecov Report

Merging #443 (2f04137) into main (23d81d0) will decrease coverage by 1.09%. The diff coverage is 6.55%.

@@            Coverage Diff             @@
##             main     #443      +/-   ##
==========================================
- Coverage   32.67%   31.58%   -1.10%     
==========================================
  Files          80       80              
  Lines        4682     5047     +365     
==========================================
+ Hits         1530     1594      +64     
- Misses       3152     3453     +301     
Impacted Files Coverage Δ
R/item_intercor.R 83.33% <ø> (ø)
R/check_outliers.R 9.30% <6.55%> (+9.30%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Jul 01 '22 04:07 codecov-commenter

Thanks a lot, impressive PR! Indeed, it takes some time to read your explanation and look at the changes, but I'll try to do this the next days.

strengejacke avatar Jul 01 '22 05:07 strengejacke

One point: please check if check_model() resp. plot(check_outliers())` still works as expected.

strengejacke avatar Jul 01 '22 05:07 strengejacke

Thanks so much. And no rush. We got time. And yes, I forgot to add the plot demo! Thank you for pointing that out. I have updated my reprex accordingly. 👍

rempsyc avatar Jul 01 '22 13:07 rempsyc

Is there a reason you closed and deleted? @rempsyc

bwiernik avatar Jul 15 '22 22:07 bwiernik

I'm sorry I realized I named my branch check_model instead of check_outliers, so I renamed it on my end thinking it would update here without breaking anything. Somehow it closed the PR! I'm not sure how to fix this... I just tried restoring the branch but I'm not sure it worked correctly.

rempsyc avatar Jul 15 '22 22:07 rempsyc

The correct branch is here: https://github.com/rempsyc/performance/tree/check_outliers

However, I don't see how I can merge them back to this PR or whether I should open a new PR with the correct name. I'm afraid opening a new PR will lose the existing discussion here.

Maybe it's not a big deal to keep the wrong branch name after all. I'm sorry for this unexpected extra trouble!

rempsyc avatar Jul 15 '22 22:07 rempsyc

Maybe use orange as the color, rather than red? Or let's pick a better red.

@strengejacke where are the insight colors specified?

bwiernik avatar Jul 24 '22 10:07 bwiernik

?insight::print_color provides the following information:

colour	
Character vector, indicating the colour for printing. May be one of "red", "yellow", "green",
"blue", "violet", "cyan" or "grey".

So there is no orange (yet). It would indeed be nice to know how to add more colours. insight::print_color is basically only defined as cat(.colour(colour = color, x = text)), but I can find no information with ?.colour. Personally though, I would probably go with orange, else for a better red, I would choose a lighter one, like almost pinkish.

rempsyc avatar Jul 25 '22 00:07 rempsyc

I like the default R plotting palette col = 2. That's the red I picked out when R updated its colors in 4.0

bwiernik avatar Jul 25 '22 01:07 bwiernik

Ok colours are defined here: https://github.com/easystats/insight/blob/18b5aaee8735ff97022b045cd4d81ffe7e207ff2/R/colour_tools.R

E.g. .colour is:

.colour <- function(colour = "red", x) {
  switch(colour,
    red = .red(x),
    yellow = .yellow(x),
    green = .green(x),
    [...]
     )
}

And individual colours each have their own function, e.g.:

.red <- function(x) {
  if (.supports_color()) {
    x[!is.na(x)] <- paste0("\033[31m", x[!is.na(x)], "\033[39m")
  }
  x
}

So definitely seems possible to add more choices. Should I attempt a PR to insight to add colour col = 2?

rempsyc avatar Jul 25 '22 01:07 rempsyc

Maybe let's just swap the current color palette (which I think is the default color palette used by {cli} for one of the other palettes that is more accessible? Maybe cli::ansi_palette_show("vscode")?

image

See cli::ansi_palettes["vscode",]

I think the bright colors there look pretty good on both light and dark backgrounds.

Another option that might be even better would be to support {cli}'s option getOption("cli.palette"). If that is specified, we could call the relevant {cli} function and use the user-specified palette rather than the default.

bwiernik avatar Jul 29 '22 00:07 bwiernik

In any event, the color printing discussion can move to {insight}. Let's get this one merged.

Another great enhancement in the future would be for this function to pull loo results for Stan based models

bwiernik avatar Jul 29 '22 00:07 bwiernik

Should I convert this PR to a draft to avoid an accidental merge? Given that there are still questions/issues to be resolved. Perhaps it would be helpful if I were to rephrase my earlier questions as simplified suggestions instead? Here’s an attempt:

  • [x] 1. Yes (include thresholds on same line as methods)
  • [x] 2. OK to leave like this (always use plural)
  • [x] 3. Yes (don’t print detailed output when > 1 method selected)
  • [x] 4. Yes (add row and ID to outlier info data frame)
  • [x] 5. Yes (don’t print detailed output when > 1 method selected)

Also note that lists/dataframes can’t be printed with insight::print_color so they print white instead of yellow/red like the non-detailed output.

Still, I would like to receive the “green light” before moving forward with the rest of the changes in case this is not the outcome you desire. Thoughts?

rempsyc avatar Jul 29 '22 00:07 rempsyc

Also note that lists/dataframes can’t be printed with insight::print_color so they print white instead of yellow/red like the non-detailed output.

That's good. A whole data frame would be too much in color I think

bwiernik avatar Jul 29 '22 00:07 bwiernik

Sorry I didn't see the questions at the bottom of the first post.

  1. Same line like you suggest is fine
  2. Fine to leave plural for now. We should make an insight function for pluralizing words that we can use here and elsewhere @strengejacke
  3. Sure I think omitting detailed output is fine then. What all would show and not show? If we do that, do we want to default to a single method like Cook's D/LOO?
  4. I'm not sure what this question means.
  5. Agreed let's reduce output per (3)

bwiernik avatar Jul 29 '22 01:07 bwiernik

Great! I'll start working on that. For 3, sorry that it wasn't clear. I'm referring to the demo from the example in the help file:

library(performance)

# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")

# Add ID information
outliers_list <- performance::check_outliers(
  data, method = c("mahalanobis", "mcd", "zscore"), ID = "car")

# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
#>   Distance_Zscore Outlier_Zscore Distance_Mahalanobis Outlier_Mahalanobis
#> 1        1.189901              0             8.946673                   0
#> 2        1.189901              0             8.287933                   0
#> 3        1.224858              0             8.937150                   0
#> 4        1.122152              0             6.096726                   0
#> 5        1.043081              0             5.429061                   0
#> 6        1.564608              0             8.877558                   0
#>   Distance_MCD Outlier_MCD Outlier
#> 1    11.508353           0       0
#> 2     8.618865           0       0
#> 3    12.265382           0       0
#> 4    14.351997           0       0
#> 5     8.639128           0       0
#> 6    12.003840           0       0

Created on 2022-07-28 by the reprex package (v2.0.1)

So here we see that the data frame resulting from as.data.frame(outliers_list) does not contain the ID information, although it was requested. Should I add it there as well, i.e., as a unique column at the very beginning? Furthermore, the row number is contained as the row names, but sometimes it is useful to have it as a column as well.

Ultimately, that data frame is similar to my detailed list output, except that it prints the information for all observations, not only outliers. Thus why I was thinking of adding ID/row there as well for consistency.

Sure I think omitting detailed output is fine then. What all would show and not show? If we do that, do we want to default to a single method like Cook's D/LOO?

If we are to print the detailed output for method = all, then the Outliers per variable (univariate methods) section would repeat again as many times as there are univariate methods (and would be renamed accordingly for each method, e.g., Outliers per variable (z-score)). However, if we are to only print detailed output when a single method is selected, then the detailed section will not be visible.

I don't think we need to change the default methods. Sure, most people using multiple methods might never see the changes and detailed output, but I think that's ok since it would mostly be useful to those using single methods anyway. Plus, default method for class numeric is already a single univariate method (zscore_robust).

rempsyc avatar Jul 29 '22 01:07 rempsyc

I think reduced output with "all" is good. And let's add the id as a column

bwiernik avatar Jul 29 '22 01:07 bwiernik

@rempsyc @bwiernik please check if my latest changes (https://github.com/easystats/performance/pull/443/commits/c14396c55ad21843687099fe60feb489a7c095bc) are ok.

strengejacke avatar Aug 10 '22 07:08 strengejacke

I think that's correct but @rempsyc would know better

bwiernik avatar Aug 10 '22 08:08 bwiernik

Maybe use orange as the color, rather than red? Or let's pick a better red.

@strengejacke where are the insight colors specified?

Colours are defined in https://github.com/easystats/insight/blob/main/R/colour_tools.R We also have a function called color_theme(), which returns the currently used theme (https://github.com/easystats/insight/blob/main/R/print_color.R), however this doesn't work out RStudio, I think.

strengejacke avatar Aug 10 '22 08:08 strengejacke

I'm going to revert that last commit (fix checkk issues, c14396c) for now because I have changed the code substantially since my initial commit. The failed tests are normal because I had only implemented the new output for one method as a sort of proof of concept to get approval for larger changes. This will be fixed shortly.

I am still working on it (it's huge). But I'm almost done.

Edit: I cannot find how to do a revert commit from RStudio or from the GitHub website. Hum.

rempsyc avatar Aug 10 '22 15:08 rempsyc

I think the commits should be reverted now.

strengejacke avatar Aug 10 '22 15:08 strengejacke

Kind of long reprex, but here we go!

Reprex

# Load package
devtools::load_all()
#> ℹ Loading performance
# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")
  1. Singular (vs. plural) is supported for “outlier”, “case”, “method”, “threshold”, and “variable”.
  2. Thresholds are now included in parenthesis next to the method.
  3. When a single variable or object is passed, the variable or object name for numeric vectors is now also passed through sys.call.
performance::check_outliers(data$mpg, method = "zscore", threshold = 2.2)
#> 1 outlier detected: case 20.
#> - Based on the following method and threshold: zscore (2.2).
#> - For variable: data$mpg.
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore): 
#> 
#> $`data$mpg`
#>    Row Distance_Zscore
#> 20  20        2.291272
  1. Repeated outliers are also flagged in a special count/frequency table, if any, with a count of how many variables they were flagged as outlier for.
performance::check_outliers(data, method = "zscore", threshold = 2.7)
#> 2 outliers detected: cases 9, 31.
#> - Based on the following method and threshold: zscore (2.7).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>   Row n_Zscore
#> 1  31        2
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore): 
#> 
#> $hp
#>    Row Distance_Zscore
#> 31  31        2.746567
#> 
#> $qsec
#>   Row Distance_Zscore
#> 9   9        2.826755
#> 
#> $carb
#>    Row Distance_Zscore
#> 31  31        3.211677
  1. The count/frequency table also supports multiple methods.
  2. Outliers per variable are not printed when more than one method is selected to avoid excessively long outputs.
x <- performance::check_outliers(data, method = c("zscore", "iqr"))
x
#> 6 outliers detected: cases 9, 15, 16, 17, 20, 31.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>   Row n_Zscore n_IQR
#> 1  31        2     2
  1. However, they are accessible through attributes if needed.
attributes(x)$outlier_var$zscore
#> $mpg
#>    Row Distance_Zscore
#> 18  18        2.042389
#> 20  20        2.291272
#> 
#> $hp
#>    Row Distance_Zscore
#> 31  31        2.746567
#> 
#> $drat
#>    Row Distance_Zscore
#> 19  19        2.493904
#> 
#> $wt
#>    Row Distance_Zscore
#> 15  15        2.077505
#> 16  16        2.255336
#> 17  17        2.174596
#> 
#> $qsec
#>   Row Distance_Zscore
#> 9   9        2.826755
#> 
#> $carb
#>    Row Distance_Zscore
#> 30  30        1.973440
#> 31  31        3.211677
attributes(x)$outlier_var$iqr
#> $mpg
#>    Row Distance_IQR
#> 20  20            1
#> 
#> $hp
#>    Row Distance_IQR
#> 31  31            1
#> 
#> $wt
#>    Row Distance_IQR
#> 15  15            1
#> 16  16            1
#> 17  17            1
#> 
#> $qsec
#>   Row Distance_IQR
#> 9   9            1
#> 
#> $carb
#>    Row Distance_IQR
#> 31  31            1
  1. The count/frequency table is not printed when a single, multivariate method is selected (since it would be redundant).
performance::check_outliers(data, method = "mahalanobis")
#> 1 outlier detected: case 9.
#> - Based on the following method and threshold: mahalanobis (21.92).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
  1. Multivariate methods also are integrated in the count/frequency table, along the univariate methods. It does look a bit monstrous when using all the methods at once. But in most situations it should work as expected.
performance::check_outliers(data, method = c(
  "zscore", "zscore_robust", "iqr", "ci", "eti", "hdi", "bci", "mahalanobis",
  "mahalanobis_robust", "mcd", "ics", "optics", "lof"))
#> 3 outliers detected: cases 9, 15, 31.
#> - Based on the following methods and thresholds: zscore (1.96), zscore_robust (1.96), iqr (1.5), ci (0.95), eti (0.95), hdi (0.95), bci (0.95), mahalanobis (21.92), mahalanobis_robust (21.92), mcd (21.92), ics (0.03), optics (22), lof (0.03).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>    Row n_Zscore n_Zscore_robust n_IQR n_ci n_eti n_bci  n_Mahalanobis
#> 1   31        2               4     2    2     2     3              0
#> 2    9        1               2     1    1     1     1 (Multivariate)
#> 3   15        1               2     1    1     1     2              0
#> 4   16        1               1     1    1     1     2              0
#> 5   18        1               3     0    0     0     0              0
#> 6   19        1               4     0    2     2     3              0
#> 7   20        1               3     1    2     2     2              0
#> 8   30        1               2     0    0     0     0              0
#> 9   28        0               4     0    1     1     1              0
#> 10   3        0               2     0    0     0     0              0
#> 11  26        0               2     0    0     0     0              0
#> 12  29        0               2     0    1     1     1              0
#> 13  32        0               2     0    0     0     0              0
#> 14   4        0               1     0    0     0     0              0
#> 15   8        0               1     0    0     0     0              0
#> 16  21        0               1     0    0     0     0              0
#> 17  27        0               1     0    0     0     1              0
#> 18   7        0               0     0    0     0     0              0
#> 19  24        0               0     0    0     0     0              0
#>    n_Mahalanobis_robust          n_MCD          n_ICS          n_LOF
#> 1        (Multivariate) (Multivariate)              0 (Multivariate)
#> 2        (Multivariate) (Multivariate) (Multivariate)              0
#> 3                     0              0              0 (Multivariate)
#> 4                     0              0              0              0
#> 5                     0              0              0              0
#> 6                     0 (Multivariate)              0              0
#> 7                     0              0              0              0
#> 8                     0 (Multivariate)              0 (Multivariate)
#> 9        (Multivariate) (Multivariate)              0              0
#> 10                    0              0              0              0
#> 11                    0              0              0              0
#> 12       (Multivariate)              0 (Multivariate)              0
#> 13                    0              0              0              0
#> 14                    0              0              0 (Multivariate)
#> 15       (Multivariate) (Multivariate)              0              0
#> 16       (Multivariate) (Multivariate)              0              0
#> 17       (Multivariate) (Multivariate)              0              0
#> 18       (Multivariate) (Multivariate)              0              0
#> 19       (Multivariate) (Multivariate)              0              0
  1. ID is supported for all those methods as well.
performance::check_outliers(data, method = c(
  "zscore", "zscore_robust", "iqr", "ci", "eti", "hdi", "bci", "mahalanobis",
  "mahalanobis_robust", "mcd", "ics", "optics", "lof"), ID = "car")
#> 3 outliers detected: cases 9, 15, 31.
#> - Based on the following methods and thresholds: zscore (1.96), zscore_robust (1.96), iqr (1.5), ci (0.95), eti (0.95), hdi (0.95), bci (0.95), mahalanobis (21.92), mahalanobis_robust (21.92), mcd (21.92), ics (0.03), optics (22), lof (0.03).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb.
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>    Row                 car n_Zscore n_Zscore_robust n_IQR n_ci n_eti n_bci
#> 1   31       Maserati Bora        2               4     2    2     2     3
#> 2    9            Merc 230        1               2     1    1     1     1
#> 3   15  Cadillac Fleetwood        1               2     1    1     1     2
#> 4   16 Lincoln Continental        1               1     1    1     1     2
#> 5   18            Fiat 128        1               3     0    0     0     0
#> 6   19         Honda Civic        1               4     0    2     2     3
#> 7   20      Toyota Corolla        1               3     1    2     2     2
#> 8   30        Ferrari Dino        1               2     0    0     0     0
#> 9   28        Lotus Europa        0               4     0    1     1     1
#> 10   3          Datsun 710        0               2     0    0     0     0
#> 11  26           Fiat X1-9        0               2     0    0     0     0
#> 12  29      Ford Pantera L        0               2     0    1     1     1
#> 13  32          Volvo 142E        0               2     0    0     0     0
#> 14   4      Hornet 4 Drive        0               1     0    0     0     0
#> 15   8           Merc 240D        0               1     0    0     0     0
#> 16  21       Toyota Corona        0               1     0    0     0     0
#> 17  27       Porsche 914-2        0               1     0    0     0     1
#> 18   7          Duster 360        0               0     0    0     0     0
#> 19  24          Camaro Z28        0               0     0    0     0     0
#>     n_Mahalanobis n_Mahalanobis_robust          n_MCD          n_ICS
#> 1               0       (Multivariate) (Multivariate)              0
#> 2  (Multivariate)       (Multivariate) (Multivariate) (Multivariate)
#> 3               0                    0              0              0
#> 4               0                    0              0              0
#> 5               0                    0              0              0
#> 6               0                    0 (Multivariate)              0
#> 7               0                    0              0              0
#> 8               0                    0 (Multivariate)              0
#> 9               0       (Multivariate) (Multivariate)              0
#> 10              0                    0              0              0
#> 11              0                    0              0              0
#> 12              0       (Multivariate)              0 (Multivariate)
#> 13              0                    0              0              0
#> 14              0                    0              0              0
#> 15              0       (Multivariate) (Multivariate)              0
#> 16              0       (Multivariate) (Multivariate)              0
#> 17              0       (Multivariate) (Multivariate)              0
#> 18              0       (Multivariate) (Multivariate)              0
#> 19              0       (Multivariate) (Multivariate)              0
#>             n_LOF
#> 1  (Multivariate)
#> 2               0
#> 3  (Multivariate)
#> 4               0
#> 5               0
#> 6               0
#> 7               0
#> 8  (Multivariate)
#> 9               0
#> 10              0
#> 11              0
#> 12              0
#> 13              0
#> 14 (Multivariate)
#> 15              0
#> 16              0
#> 17              0
#> 18              0
#> 19              0
  1. It still supports models
model <- lm(disp ~ mpg + hp, data = data)
check_outliers(model)
#> 1 outlier detected: case 31.
#> - Based on the following method and threshold: cook (0.81).
#> - For variable: (Whole model).
  1. And multiple methods including model methods:
check_outliers(model, method = c("cook", "optics", "lof"))
#> 1 outlier detected: case 31.
#> - Based on the following methods and thresholds: cook (0.81), optics (6), lof (0.03).
#> - For variable: (Whole model).
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>   Row       n_OPTICS          n_LOF         n_Cook
#> 1  31 (Multivariate)              0 (Multivariate)
#> 2   6              0 (Multivariate)              0
#> 3  10              0 (Multivariate)              0
#> 4  11              0 (Multivariate)              0
#> 5  22              0 (Multivariate)              0
#> 6  23              0 (Multivariate)              0
#> 7  28              0 (Multivariate)              0
  1. Bayesian models…
suppressMessages(library(rstanarm))
invisible(capture.output(model <- stan_glm(mpg ~ qsec + wt, data = data)))
check_outliers(model, method = "pareto", threshold = list("pareto" = 0.4))
#> 3 outliers detected: cases 9, 18, 20.
#> - Based on the following method and threshold: pareto (0.4).
#> - For variable: (Whole model).
  1. Multiple methods including Bayesian models:
check_outliers(model, method = c("pareto", "optics", "lof"),
               threshold = list("pareto" = 0.4))
#> 1 outlier detected: case 9.
#> - Based on the following methods and thresholds: pareto (0.4), optics (6), lof (0.03).
#> - For variable: (Whole model).
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>   Row       n_OPTICS          n_LOF       n_Pareto
#> 1   9 (Multivariate) (Multivariate) (Multivariate)
#> 2  18              0              0 (Multivariate)
#> 3  20              0              0 (Multivariate)
  1. When using grouped data frames, the attributes are stored differently (by group). The outlier info is printed for each group. Because check_outliers is applied individually to each group, the default was that rows resetted for each group (e.g., group 1 = 1:50, group 2 = 1:50, etc.). However, I realized that this could be confusing (if one wants to use that for decision making) so I have added an a posteriori correction to row numbers so they reflect the original data set instead.
suppressMessages(library("poorman"))
data.group <- iris %>%
  group_by(Species)

z <- check_outliers(data.group, method = c("zscore", "iqr"))
z
#> 13 outliers detected: cases 14, 16, 23, 24, 25, 42, 44, 45, 99, 107, 118, 120, 132.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5).
#> - For variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#> $setosa
#>   Row n_Zscore n_IQR
#> 1  14        2     1
#> 2  16        2     1
#> 
#> $versicolor
#>   Row n_Zscore n_IQR
#> 1  58        2     0
#> 
#> $virginica
#>   Row n_Zscore n_IQR
#> 1 118        2     1
#> 2 132        2     1
  1. So specific per-variable info for a specific group and method may be obtained in the following way:
attributes(z)$outlier_var$versicolor$zscore
#> $Sepal.Length
#>   Row Distance_Zscore
#> 1  51        2.061332
#> 8  58        2.007086
#> 
#> $Sepal.Width
#>    Row Distance_Zscore
#> 11  61        2.453805
#> 36  86        2.007659
#> 
#> $Petal.Length
#>    Row Distance_Zscore
#> 8   58        2.042940
#> 44  94        2.042940
#> 49  99        2.681359
#> 
#> $Petal.Width
#>    Row Distance_Zscore
#> 21  71        2.396933
  1. Outlier data now includes group as it is more informative.
head(attributes(z)$data)
#>   Row Species Distance_Zscore Outlier_Zscore Distance_IQR Outlier_IQR Outlier
#> 1   1  setosa        1.000000              0            0           0       0
#> 2   2  setosa        1.129096              0            0           0       0
#> 3   3  setosa        1.000000              0            0           0       0
#> 4   4  setosa        1.151807              0            0           0       0
#> 5   5  setosa        1.000000              0            0           0       0
#> 6   6  setosa        1.461300              0            0           0       0
  1. Also supports single methods of course:
check_outliers(data.group, method = "zscore")
#> 25 outliers detected: cases 14, 15, 16, 19, 23, 24, 25, 34, 42, 44, 45, 51, 58, 61, 71, 86, 94, 99, 107, 118, 119, 120, 123, 132, 135.
#> - Based on the following method and threshold: zscore (1.96).
#> - For variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width.
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#> $setosa
#>   Row n_Zscore
#> 1  14        2
#> 2  16        2
#> 
#> $versicolor
#>   Row n_Zscore
#> 1  58        2
#> 
#> $virginica
#>   Row n_Zscore
#> 1 118        2
#> 2 132        2
#> 
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore): 
#> 
#> $setosa
#> $setosa$zscore
#> $setosa$zscore$Sepal.Length
#>    Row Distance_Zscore
#> 14  14        2.002895
#> 15  15        2.252548
#> 16  16        1.968852
#> 19  19        1.968852
#> 
#> $setosa$zscore$Sepal.Width
#>    Row Distance_Zscore
#> 16  16        2.564208
#> 34  34        2.036593
#> 42  42        2.975748
#> 
#> $setosa$zscore$Petal.Length
#>    Row Distance_Zscore
#> 14  14        2.084485
#> 23  23        2.660310
#> 25  25        2.522112
#> 45  45        2.522112
#> 
#> $setosa$zscore$Petal.Width
#>    Row Distance_Zscore
#> 24  24        2.410197
#> 44  44        3.359093
#> 
#> 
#> 
#> $versicolor
#> $versicolor$zscore
#> $versicolor$zscore$Sepal.Length
#>   Row Distance_Zscore
#> 1  51        2.061332
#> 8  58        2.007086
#> 
#> $versicolor$zscore$Sepal.Width
#>    Row Distance_Zscore
#> 11  61        2.453805
#> 36  86        2.007659
#> 
#> $versicolor$zscore$Petal.Length
#>    Row Distance_Zscore
#> 8   58        2.042940
#> 44  94        2.042940
#> 49  99        2.681359
#> 
#> $versicolor$zscore$Petal.Width
#>    Row Distance_Zscore
#> 21  71        2.396933
#> 
#> 
#> 
#> $virginica
#> $virginica$zscore
#> $virginica$zscore$Sepal.Length
#>    Row Distance_Zscore
#> 7  107        2.654591
#> 32 132        2.063284
#> 
#> $virginica$zscore$Sepal.Width
#>    Row Distance_Zscore
#> 18 118        2.561267
#> 20 120        2.400025
#> 32 132        2.561267
#> 
#> $virginica$zscore$Petal.Length
#>    Row Distance_Zscore
#> 18 118        2.080107
#> 19 119        2.442495
#> 23 123        2.080107
#> 
#> $virginica$zscore$Petal.Width
#>    Row Distance_Zscore
#> 35 135        2.279264
  1. BFBayesFactor support:
suppressMessages(library(BayesFactor))
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "numericVector" of class "Mnumeric"; definition not updated
output <- regressionBF(rating ~ ., data = attitude, progress=FALSE)

check_outliers(output, threshold = 15)
#> 1 outlier detected: case 18.
#> - Based on the following method and threshold: mahalanobis (15).
#> - For variables: complaints, privileges, learning, raises, critical, advance, rating.
  1. BFBayesFactor, multiple methods:
check_outliers(output, method = c("zscore", "iqr", "mcd"))
#> 5 outliers detected: cases 6, 14, 16, 24, 26.
#> - Based on the following methods and thresholds: zscore (1.96), iqr (1.5), mcd (16.01).
#> - For variables: complaints, privileges, learning, raises, critical, advance, rating.
#> 
#> Note: Outliers were classified as such by at least half of the selected methods. 
#> 
#> -----------------------------------------------------------------------------
#> The following observations were considered outliers for two or more variables 
#> by at least one of the selected methods: 
#> 
#>   Row n_Zscore n_IQR          n_MCD
#> 1  21        2     0              0
#> 2  24        2     0 (Multivariate)
#> 3  26        2     1              0
#> 4  14        1     0 (Multivariate)
#> 5  16        1     0 (Multivariate)
#> 6   9        0     0 (Multivariate)
#> 7  18        0     0 (Multivariate)
  1. gls support:
library(nlme)
fm1 <- gls(follicles ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary,
           correlation = corAR1(form = ~ 1 | Mare))
check_outliers(fm1, method = "zscore_robust", threshold = list(zscore_robust = 2.2))
#> 10 outliers detected: cases 15, 43, 97, 126, 155, 183, 212, 240, 267, 295.
#> - Based on the following method and threshold: zscore_robust (2.2).
#> - For variable: (Whole model).
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore_robust): 
#> 
#> $`cos(2 * pi * Time)`
#>     Row Distance_Zscore_robust
#> 15   15               2.201852
#> 43   43               2.201852
#> 97   97               2.201852
#> 126 126               2.201852
#> 155 155               2.201852
#> 183 183               2.201852
#> 212 212               2.201852
#> 240 240               2.201852
#> 267 267               2.201852
#> 295 295               2.201852
  1. gls already did not support multiple methods:
check_outliers(fm1, method = c("zscore_robust", "iqr"))
#> Error in if (!method %in% valid_methods) {: the condition has length > 1
check_outliers(fm1, method = "all")
#> Error in if (!method %in% valid_methods) {: the condition has length > 1
  1. pareto method is not working properly with gls (in my testing).
check_outliers(fm1, method = "pareto", threshold = list(pareto = 0))
#> Converting missing values (`NA`) into regular values currently not
#>   possible for variables of class 'NULL'.
#> OK: No outliers detected.
#> - Based on the following method and threshold:  ().
#> - For variable: (Whole model)

This is because pareto has a special condition to trigger: if ("pareto" %in% method & insight::model_info(x)$is_bayesian) { And if we check, we realize that the fm1 model is not Bayesian:

insight::model_info(fm1)$is_bayesian
#> [1] FALSE

Therefore, I would need some pointers to continue troubleshooting this. For example, an example of a correct and fully compatible gls model. This behaviour is not new: in the previous version, it could also not find outliers, regardless of the pareto threshold (even = 0) (on this model).

Created on 2022-08-12 by the reprex package (v2.0.1)

rempsyc avatar Aug 12 '22 18:08 rempsyc

Comments

    • [x] ~~I initially wrapped the coloured messages in insight::format_message since they can get pretty long with many methods. However, it seems like it is giving strange line break behaviours like jumping random line when less than half of my console is used for the total printed string line and when my console is more than 3/4 of my screen. So I ended up removing it. Demo:~~ [MOVED TO easystats/insight#610]
    • [x] ~~When programming with datawizard::data_filter, it does not allow me to use character strings for the logical expression of the filter argument. I could not make it work with deparse(substitute()), double curly brackets {{ }}, or the bang bang operator !!. So I had to resort to base R instead. From the documentation I saw that data_filter is based on subset, which itself has the following warning: > This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.~~ [MOVED TO easystats/datawizard#216]
    • [x] I was getting a warning with method = lof. I have thus adapted the function to use the minPts argument instead (and checked that the results were the same; they are).
#> In dbscan::lof(x, k = ncol(x) - 1) :
#>  lof: k is now deprecated. use minPts = 11 instead .
    • [x] There was a single ci threshold for all four ci methods ci, hdi, eti, and bci. This created some problems down the line, so I gave them their own threshold.
    • [x] I was not too sure why the thresholds attribute inherits of all the possible thresholds rather than the thresholds of the respective methods. So I have changed it to only selected methods.
    • [x] Similarly, model attributes inherit from both cook and pareto methods, but only one is used (based on whether it is a Bayesian model or not). So I changed it so that the inherited model attribute is also the one selected.
    • [x] In the attributes, all the data_method were named in lower case, whereas ICS was in capitals. To make it consistent with the rest of the methods, I changed it to lower case. I find it easier to program and debug when everything is consistent.
    • [x] ~~For the IQR and CI methods, an average is calculated for the distance (e.g., Distance_IQR), whereas the max is used for the zscore. I wonder if that is considered inconsistent, or if it is intentional because the IQR scores seem to be only 0 or 1 (so is it more informative to have the average than just 1?). In any case, I find it a bit strange since people might expect the same logic to be applied across the different methods. And using the average will be affected by the number of columns.~~ [MOVED TO #467]
    • [x] ~~There are discrepancies within the method thresholds in different places within the code. So for example, the iqr threshold is set to 1.96 in thresholds, but to 1.5 per default in .check_outliers_iqr. What is also strange is that it seems like the iqr values don’t get bigger than 1 anyway?~~ [MOVED TO #467]
    • [ ] For method = “ics”, there is a tryCatch call at some point. I don’t really know much about this, but I thought it was used interactively when trying to debug but that it usually wasn’t a normal part of mature functions. Does that mean that this code bit is a vestige from unfinished debugging?
    • [x] ~~I haven’t touched method = “iforest” because it was commented. I tried uncommenting it to see if I could figure out how to solve the issue but I wasn’t easily able to, so I commented it again.~~ [MOVED TO #470]
    • [x] ~~For method = “lof”, there is the following note: # TODO: use tukey_mc from bigutilsr package.~~ [MOVED TO #469]
    • [x] ~~For method = “optics”, there is the following note: # TODO: find automatic way of setting 'xi'~~ [MOVED TO #468]
    • [x] ~~With method = optics, I am not able to detect outliers on the mtcars dataset with the default threshold (2 * ncol(x) = 22), whereas increasing the threshold beyond the default results in a warning (this might be related to the previous point).~~ [MOVED TO #468]
    • [x] ~~I think the default thresholds for zscore methods (threshold = 1.96) are conservative. I wonder whether we should follow the more conventional 3 deviations instead. I don’t know about the appropriateness of the thresholds of the other methods however.~~ [MOVED TO #467]
    • [x] In the documentation, the ellipsis is described as below. However, .check_outliers_mahalanobis also seems to make use of the ellipsis. So I have updated the documentation accordingly. :

    When method = "ics", further arguments in ... are passed down to ICSOutlier::ics.outlier().

    • [x] It is not currently possible to have ID names included when feeding a model because the data extracted from the model does not contain the full original dataframe including that information. In fact, right now, attempting to add the ID argument when feeding a model throws an error, and I’m not too sure how to solve it because it is passed in the ellipsis. My attempts at transforming the ellipsis into a list and feeding it back to the respective function have so far failed. However, objects of class “model” are currently not compatible with method = ics or method = mahalanobis (along either cook or pareto), so I have simply removed the ellipsis for objects of class ‘model’.
#> check_outliers(model, method = c("ics", "cook"))
#>   "check_outliers()" does not support models of class "data.frame".
#>   Error in .check_outliers_ics(x, threshold = thresholds$ics) :
#>     trying to get slot "ics.dist.cutoff" from an object of a basic class ("NULL") with no slots 
#> 
#>   check_outliers(model, method = c("mahalanobis", "cook"))
#>   Error in solve.default(cov, ...) :
#>     Lapack routine dgesv: system is exactly singular: U[1,1] = 0

So I had it throw a warning if people attempt it anyway:

#> Warning message:
#>     In check_outliers.default(model, method = 'cook', ID = ID.names) :
#>     ID argument not supported with objects of class 'model'
    • [x] There is a test failing (there is only one test) but I can’t figure out why since they seem to be correctly outputting the same error message. [fixed by strengejacke]
    • [x] ~~Colleagues I have convinced of using check_outliers have wondered why mahalanobis never finds any missing data. It seems that in the presence of NA values, something goes wrong, even with the most strict of thresholds. So turns out that colleagues might have erroneously reported no outliers just because they had a single missing value because there is no appropriate warning to this effect. This outcome is not new; the cause seems to lie within the base R mahalanobis function. How would you suggest addressing this? Should we throw a warning and use na.omit or a variant thereof, or just throw an error and ask people to deal with it beforehand? Seems like the way other multivariate methods “deal” with it, so that’s what I did for now.~~ [MOVED TO #466]
    • [x] ~~When using datawizard select helpers within functions, it triggers R CMD check warnings e.g., check_outliers.data.frame: no visible global function definition for ‘contains’ because they’re not functions per se (I don’t know what they are actually). What’s the proper way to fix this warning? I could not find a single instance of the select helpers in all of performance’s functions.~~ [MOVED TO easystats/datawizard#218]
  1. There are other warnings but I will wait to get your feedback before we discuss the other ones because this is getting pretty long.

Created on 2022-08-12 by the reprex package (v2.0.1)

rempsyc avatar Aug 12 '22 18:08 rempsyc

Ok I tried moving as many points as possible to their respective issues, but please let me know if you think some of these other points would deserve their own issue, and I will make the transfer. I have also added check-boxes to know what is a simple note or explanation of what I have changed, and what possibly requires external input. I hope this new organization will make it more chewable and easier to manage.

rempsyc avatar Aug 13 '22 22:08 rempsyc

Thanks a lot, that's really impressive! I'll try to look at this PR asap.

strengejacke avatar Aug 13 '22 22:08 strengejacke

This is a lot of detail and work! Wow! If anyone wants me to look at something, can you @ me with a pointer to the spot?

bwiernik avatar Aug 14 '22 00:08 bwiernik

There is a test failing (there is only one test) but I can’t figure out why since they seem to be correctly outputting the same error message.

The message is formatted using insight::format_message(). In the test environment, you have a certain line length, so it's likely that the output which is tested against has multiple lines. I shortened the string, should work now.

strengejacke avatar Aug 24 '22 07:08 strengejacke

I think the warnings need to be addressed, in particular the usage of data_filter() probably needs to be replaced by base R, since we don't want to define global variables.

strengejacke avatar Aug 24 '22 07:08 strengejacke