datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

Suggestion of new function: `describe_missing()`

Open rempsyc opened this issue 1 year ago • 8 comments

Fixes #454

rempsyc avatar Nov 11 '24 11:11 rempsyc

Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use missing_ instead of na_ and other suggestions (I initially chose na to make shorter column names so the whole output could fit on my rather narrow console). I can also add a new column complete_rate to mirror skim(). Otherwise, skim() and describe_missing() have the same relative structure (variables in the first column and aggregate stats on the other columns).

the default output looks unexpected to me (I'd rather expect one row per variable).

There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the : indicator to show which variables each row includes).

But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work.

rempsyc avatar Dec 17 '24 02:12 rempsyc

Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows:

However, this behaviour is overwritten if scales or variables are specified:

library(datawizard)

# Use the entire data frame
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#>           variable n_columns n_missing cells missing_percent complete_percent
#> 1               ID         1         7    14           50.00            50.00
#> 2       openness_1         1         4    14           28.57            71.43
#> 3       openness_2         1         4    14           28.57            71.43
#> 4       openness_3         1         3    14           21.43            78.57
#> 5   extroversion_1         1         6    14           42.86            57.14
#> 6   extroversion_2         1         6    14           42.86            57.14
#> 7   extroversion_3         1         5    14           35.71            64.29
#> 8  agreeableness_1         1         3    14           21.43            78.57
#> 9  agreeableness_2         1         4    14           28.57            71.43
#> 10 agreeableness_3         1         3    14           21.43            78.57
#> 11           Total        10        45   140           32.14            67.86
#>    missing_max missing_max_percent all_missing
#> 1            1                 100           7
#> 2            1                 100           4
#> 3            1                 100           4
#> 4            1                 100           3
#> 5            1                 100           6
#> 6            1                 100           6
#> 7            1                 100           5
#> 8            1                 100           3
#> 9            1                 100           4
#> 10           1                 100           3
#> 11          10                 100           2

# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

# Otherwise you can provide nested columns manually:
describe_missing(df,
                 select = list(
                   c("ID"),
                   c("openness_1", "openness_2", "openness_3"),
                   c("extroversion_1", "extroversion_2", "extroversion_3"),
                   c("agreeableness_1", "agreeableness_2", "agreeableness_3")
                 )
)
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

Created on 2024-12-16 with reprex v2.1.1

rempsyc avatar Dec 17 '24 03:12 rempsyc

I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la skimr rather than something specialized for psychology that I think could live in the rempsyc package.

@easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function?

etiennebacher avatar Dec 17 '24 15:12 etiennebacher

I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅)

mattansb avatar Dec 18 '24 06:12 mattansb

Codecov Report

Attention: Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.25%. Comparing base (81dd0e0) to head (0e83588). Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
R/describe_missing.R 95.00% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   91.14%   91.25%   +0.11%     
==========================================
  Files          76       77       +1     
  Lines        6045     6144      +99     
==========================================
+ Hits         5510     5607      +97     
- Misses        535      537       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 18 '24 15:12 codecov[bot]

If I understand, the main outstanding issue is what to do with the "scales" argument. I would indeed remove it (soz Rémi ^^) and replace it by a by argument as in our other function. If users want to compute the amount of missing per dimension, they should do it using a more traditional approach and first pivot to longer and then run describe_missing(select="item", by="dimension") otherwise I'm afraid it gets messy if we have a bespoke scales argument only for this function

DominiqueMakowski avatar Dec 18 '24 15:12 DominiqueMakowski

Alright, in this case, I think I can introduce select, exclude, and by and make it more consistent with the rest of datawizard 🤓

rempsyc avatar Dec 18 '24 15:12 rempsyc

Alright, this is a much simplified version which now also support "by". So this is what I have so far:

library(datawizard)

describe_missing(airquality, select = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Ozone        37           24.18            75.82
#> 2  Solar.R         7            4.58            95.42
#> 3     Wind         0            0.00           100.00
#> 4     Temp         0            0.00           100.00
#> 5    Total        44            7.19            92.81

describe_missing(airquality, exclude = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Month         0               0              100
#> 2      Day         0               0              100
#> 3    Total         0               0              100

# Testing the 'by' argument for survey scales
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)

df_long <- reshape_longer(
  df,
  select = -1,
  names_sep = "_",
  names_to = c("dimension", "item"))

describe_missing(df_long, 
                 select = -c(1, 3), 
                 by = "dimension")
#>        variable n_missing missing_percent complete_percent
#> 1 agreeableness        10           23.81            76.19
#> 2  extroversion        17           40.48            59.52
#> 3      openness        11           26.19            73.81
#> 4         Total        38           15.08            84.92

Created on 2024-12-19 with reprex v2.1.1

Anything else you'd find desirable in the function?

rempsyc avatar Dec 19 '24 19:12 rempsyc