Suggestion of new function: `describe_missing()`
Fixes #454
Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use missing_ instead of na_ and other suggestions (I initially chose na to make shorter column names so the whole output could fit on my rather narrow console). I can also add a new column complete_rate to mirror skim(). Otherwise, skim() and describe_missing() have the same relative structure (variables in the first column and aggregate stats on the other columns).
the default output looks unexpected to me (I'd rather expect one row per variable).
There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the : indicator to show which variables each row includes).
But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work.
Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows:
However, this behaviour is overwritten if scales or variables are specified:
library(datawizard)
# Use the entire data frame
set.seed(15)
fun <- function() {
c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
ID = c("idz", NA),
openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#> variable n_columns n_missing cells missing_percent complete_percent
#> 1 ID 1 7 14 50.00 50.00
#> 2 openness_1 1 4 14 28.57 71.43
#> 3 openness_2 1 4 14 28.57 71.43
#> 4 openness_3 1 3 14 21.43 78.57
#> 5 extroversion_1 1 6 14 42.86 57.14
#> 6 extroversion_2 1 6 14 42.86 57.14
#> 7 extroversion_3 1 5 14 35.71 64.29
#> 8 agreeableness_1 1 3 14 21.43 78.57
#> 9 agreeableness_2 1 4 14 28.57 71.43
#> 10 agreeableness_3 1 3 14 21.43 78.57
#> 11 Total 10 45 140 32.14 67.86
#> missing_max missing_max_percent all_missing
#> 1 1 100 7
#> 2 1 100 4
#> 3 1 100 4
#> 4 1 100 3
#> 5 1 100 6
#> 6 1 100 6
#> 7 1 100 5
#> 8 1 100 3
#> 9 1 100 4
#> 10 1 100 3
#> 11 10 100 2
# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2
# Otherwise you can provide nested columns manually:
describe_missing(df,
select = list(
c("ID"),
c("openness_1", "openness_2", "openness_3"),
c("extroversion_1", "extroversion_2", "extroversion_3"),
c("agreeableness_1", "agreeableness_2", "agreeableness_3")
)
)
#> variable n_columns n_missing cells missing_percent
#> 1 ID 1 7 14 50.00
#> 2 openness_1:openness_3 3 11 42 26.19
#> 3 extroversion_1:extroversion_3 3 17 42 40.48
#> 4 agreeableness_1:agreeableness_3 3 10 42 23.81
#> 5 Total 10 45 140 32.14
#> complete_percent missing_max missing_max_percent all_missing
#> 1 50.00 1 100 7
#> 2 73.81 3 100 3
#> 3 59.52 3 100 3
#> 4 76.19 3 100 3
#> 5 67.86 10 100 2
Created on 2024-12-16 with reprex v2.1.1
I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la skimr rather than something specialized for psychology that I think could live in the rempsyc package.
@easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function?
I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅)
Codecov Report
Attention: Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.
Project coverage is 91.25%. Comparing base (
81dd0e0) to head (0e83588). Report is 8 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| R/describe_missing.R | 95.00% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 91.14% 91.25% +0.11%
==========================================
Files 76 77 +1
Lines 6045 6144 +99
==========================================
+ Hits 5510 5607 +97
- Misses 535 537 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
If I understand, the main outstanding issue is what to do with the "scales" argument. I would indeed remove it (soz Rémi ^^) and replace it by a by argument as in our other function. If users want to compute the amount of missing per dimension, they should do it using a more traditional approach and first pivot to longer and then run describe_missing(select="item", by="dimension") otherwise I'm afraid it gets messy if we have a bespoke scales argument only for this function
Alright, in this case, I think I can introduce select, exclude, and by and make it more consistent with the rest of datawizard 🤓
Alright, this is a much simplified version which now also support "by". So this is what I have so far:
library(datawizard)
describe_missing(airquality, select = "Ozone:Temp")
#> variable n_missing missing_percent complete_percent
#> 1 Ozone 37 24.18 75.82
#> 2 Solar.R 7 4.58 95.42
#> 3 Wind 0 0.00 100.00
#> 4 Temp 0 0.00 100.00
#> 5 Total 44 7.19 92.81
describe_missing(airquality, exclude = "Ozone:Temp")
#> variable n_missing missing_percent complete_percent
#> 1 Month 0 0 100
#> 2 Day 0 0 100
#> 3 Total 0 0 100
# Testing the 'by' argument for survey scales
set.seed(15)
fun <- function() {
c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
ID = c("idz", NA),
openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
df_long <- reshape_longer(
df,
select = -1,
names_sep = "_",
names_to = c("dimension", "item"))
describe_missing(df_long,
select = -c(1, 3),
by = "dimension")
#> variable n_missing missing_percent complete_percent
#> 1 agreeableness 10 23.81 76.19
#> 2 extroversion 17 40.48 59.52
#> 3 openness 11 26.19 73.81
#> 4 Total 38 15.08 84.92
Created on 2024-12-19 with reprex v2.1.1
Anything else you'd find desirable in the function?