ggstatsplot icon indicating copy to clipboard operation
ggstatsplot copied to clipboard

ggcorrmat on steroids

Open rcannood opened this issue 2 years ago • 1 comments

Hi @IndrajeetPatil !

Thank you for creating ggstatsplot and statsExpressions! :relaxed: I love having a consistent API for running different types of tests.

rcannood avatar May 18 '22 04:05 rcannood

I wanted to get your opinion on an approach I've been using. I wanted to compute a ggcorrmat(), but my data frame has not only numerical but also categorical and boolean columns.

I've been using the following code to compute a set of tests and get a general idea of which features have "some relation" to which other features:

library(tidyverse)
library(statsExpressions)

# the data that i'll be using for this example
data <- mtcars %>% mutate(
  brand = gsub(" .*", "", rownames(mtcars)),
  vs = as.logical(vs),
  am = as.logical(am)
)

# annotation of the features
feature_ann <- tibble(
  feature = colnames(data),
  type = case_when(
    map_lgl(data, is.numeric) ~ "numerical",
    map_lgl(data, function(x) length(unique(na.omit(x))) == 2) ~ "logical",
    TRUE ~ "categorical"
  )
)

# which statsExpressions functions to use for which comparisons
stat_funs <- tribble(
  ~ xtype, ~ ytype, ~ statfun,
  "numerical", "numerical", function(df, x, y) corr_test(df, !!x, !!y, type = "np"),
  "numerical", "categorical", function(df, x, y) oneway_anova(df, !!y, !!x, type = "np"),
  "numerical", "logical", function(df, x, y) two_sample_test(df, !!y, !!x, type = "np"),
  "categorical", "numerical", function(df, x, y) oneway_anova(df, !!x, !!y, type = "np"),
  "categorical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "categorical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "logical", "numerical", function(df, x, y) two_sample_test(df, !!x, !!y, type = "np"),
  "logical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "logical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np")
)

# the start of a comparison
feature_crossing <- crossing(
  x = feature_ann$feature,
  y = feature_ann$feature
) %>%
  left_join(feature_ann %>% select(x = feature, xtype = type), by = "x") %>%
  left_join(feature_ann %>% select(y = feature, ytype = type), by = "y") %>%
  left_join(stat_funs, by = c("xtype", "ytype"))

# the test results
stats_df <-
  feature_crossing %>%
  filter(x != y) %>%
  pmap_df(function(x, y, xtype, ytype, statfun) {
    odn <- data %>% select(!!x, !!y) %>% na.omit()
    # if a comparison does not have enough overlapping data points, don't run the test
    if (length(unique(odn[[x]])) <= 1 || length(unique(odn[[y]])) <= 1) return(NULL)
    statfun(data, x, y) %>%
      mutate(parameter1 = x, parameter2 = y, p1type = xtype, p2type = ytype) %>%
      select(parameter1:p2type, everything())
  }) %>%
  mutate(p.value = p.adjust(p.value, method = "holm"))

stats_df %>% select(parameter1:estimate) %>% head(10)

Head of stats_df:

parameter1 parameter2 p1type p2type statistic df p.value method effectsize estimate
am brand logical categorical 29.9271255 21 1.0000000 Pearson's Chi-squared test Cramer's V (adj.) 0.5161364
am carb logical numerical 132.5000000 NA 1.0000000 Wilcoxon rank sum test r (rank biserial) 0.0728745
am cyl logical numerical 194.0000000 NA 0.2027903 Wilcoxon rank sum test r (rank biserial) 0.5708502
am disp logical numerical 214.0000000 NA 0.0384542 Wilcoxon rank sum test r (rank biserial) 0.7327935
am drat logical numerical 24.0000000 NA 0.0111300 Wilcoxon rank sum test r (rank biserial) -0.8056680
am gear logical numerical 16.0000000 NA 0.0007906 Wilcoxon rank sum test r (rank biserial) -0.8704453
am hp logical numerical 176.0000000 NA 1.0000000 Wilcoxon rank sum test r (rank biserial) 0.4251012
am mpg logical numerical 42.0000000 NA 0.1085407 Wilcoxon rank sum test r (rank biserial) -0.6599190
am qsec logical numerical 153.0000000 NA 1.0000000 Wilcoxon rank sum test r (rank biserial) 0.2388664
am vs logical logical 0.9068826 1 1.0000000 Pearson's Chi-squared test Cramer's V (adj.) 0.0000000

When I plot this stats_df, I get something like this: plot

Do you think there is some merit to this approach? Do you see some obvious flaws?

Kind regards, Robrecht

rcannood avatar May 18 '22 04:05 rcannood