papaja icon indicating copy to clipboard operation
papaja copied to clipboard

tidystats

Open WillemSleegers opened this issue 6 years ago • 2 comments

Hey everyone,

I recently tweeted about my package called tidystats (see https://twitter.com/willemsleegers/status/1007249862268719104). As a follow-up, I would like to briefly discuss my idea here and see how it fits with papaja.

tidystats focuses on 2 problems: insufficient statistical reporting and incorrect statistical reporting. Academics (in psychological science, at least) often only report the output of their statistical tests in their manuscript. The Results section is, besides the place to report your results, also a narrative aimed at a human reader with limited time and motivation. As a result, there is a conflict between on the one hand reporting all interesting results vs. providing a streamlined story. This may lead to not reporting results that may actually be useful. Also, if you want to do something with the results, it is very annoying to have to extract it from a PDF.

The second problem is that many mistakes are made when transferring the output of statistical tests to the manuscript. This is solved by something like R Markdown and papaja (yay). Although I have to admit that I find R Markdown quite cumbersome to use and it limits collaboration with others.

I created tidystats to address both issues. With tidystats, you combine the output of multiple statistical tests into a single .csv file that can accompany your manuscript. So rather than having to worry about whether or not to include this potentially interesting statistic, you can simply have it in your tidystats .csv file. Additionally, this structured data file containing all of the statistical output can be used to easily create report functions that can be used in R Markdown. However, more interestingly perhaps, I have created also a Shiny app that works inside of RStudio that uses these report functions under-the-hood. You run the inspect() function, give it the tidystats list of results, and it will show an overview of the results in the Viewer pane of RStudio. By clicking on output, a line of APA output is created that you can copy and paste into the manuscript.

For more information, see the README at https://github.com/WillemSleegers/tidystats.

I think the report functions are clearly similar to what papaja is doing, but tidystats also has the combining of statistics to save them outside of the manuscript, which is a significant difference. I'm very curious to see where there is potential for collaboration.

WillemSleegers avatar Jun 25 '18 15:06 WillemSleegers

Hi Willem,

sorry for taking so long to get back to you. I think we can disagree about what the best way to report statistics is (personally, I prefer R Markdown to error-prone copy and paste work, but I think this is a matter of what works best for ones workflow). Regardless, I think there is room for joint efforts in formatting the output of statistical tests.

If I understand correctly, all statistics in tidystats get converted into a long format data.frame and from there are selected by an identifier and turned into reportable text strings. papaja, currently, applies a slightly different workflow. There is no joint result data.frame, which allows us to use varying columns and utilize the structure provided by the broom package (I guess this is in line with the "tidy" philosophy that proposes to use columns for different variables for common observations). On top, we extract additional information from the analysis output to retain information that is needed for the reporting.

At this point, the formatting of the statistics (based on the augmented broom output) is hard-coded into the S3 methods that process the analysis output. We are wanting to move towards a more generalized solution, though. The idea is to create a workflow that accepts a broom-like data.frame and, for example, a formula that specifies the formatting based on column names to produce an output. This would facilitate specifying alternative reporting styles using a common machinery and could be utilized by papaja's print_apa() as well as by inspect() if the function, under the hood, could break up the joint result data.frame and spread it out.

I was involved in a similar discussion previously and this kind of thing isn't easy to get a handle on. I don't think I'll be able to really tackle this before next summer, but I think in the long-term it would be worth the effort. (cc @mutlusun)

Any thoughts?

crsh avatar Aug 06 '18 09:08 crsh

Hey, thanks for getting back to me! No worries about the delay; we're all super-busy.

I agree with you that something like R Markdown is better for reporting. The problem R Markdown simply does not address, however, is that people collaborate and that not everyone uses R Markdown. My current experience with R Markdown is that I would love to use it, but in reality it ends up a nuisance. What it needs is a way to collaborate (with track changes in documents) that also non-R people can use. Then R Markdown is the way to go for sure. In the meanwhile, I use something like the inspect() function from tidystats to reduce the errors caused by manually typing over numbers (although it does not solve copy-paste errors).

Anyway, that's a separate debate!

Regarding the way tidystats and papaja works, you are right that tidystats converts the output of statistical tests (using S3 methods) to a long data frame. This data frame is what Hadley Wickham calls a tidy data frame. Ironically, the tidy() function from the broom function does not produce a tidy data frame, it simply converts the output of tests to a data frame for easy management. This is great if you want to subsequently do stuff to the output, but it's not so great for combining statistics; that requires a truly tidy data format and that's what tidystats does.

You write that you are considering moving towards a generalized solution. This is already close to my approach with the report() function in tidystats(). report() looks as the method column of the results belonging to a specific identifier, and based on that it extracts the relevant statistics for reporting. So, it is not hardcoded. Anyone can use their own method to tidy the output of a particular analysis and as long as they correctly name the kind of method and statistics, the report() function will be able to report the results.

I think a main question is whether the output of a statistical test should be tidied in the way that broom does it or the way that tidystats does it. That is, should it convert the output of a test to a data frame, wherein the columns refer to a specific statistic (e.g., b, t, F), or should it be a tidy data frame with a statistic column and a value column. In the latter case, filters can be used to extract the correct statistics.

An advantage of the more tidy data frame is that complex output can more easily be combined. As an illustration, if you want to tidy the output of a multilevel model, you can have something that looks like this:

# A tibble: 16 x 6
   group  term                     term_nr statistic    value method                   
   <chr>  <chr>                      <dbl> <chr>        <dbl> <chr>                    
 1 model  (Observations)                 1 N         180      Linear mixed model {lme4}
 2 model  Subject                        2 N          18      Linear mixed model {lme4}
 3 random Subject-(Intercept)            3 var       612.     Linear mixed model {lme4}
 4 random Subject-(Intercept)            3 SD         24.7    Linear mixed model {lme4}
 5 random Subject-Days                   4 var        35.1    Linear mixed model {lme4}
 6 random Subject-Days                   4 SD          5.92   Linear mixed model {lme4}
 7 random Subject-(Intercept)-Days       5 var         9.60   Linear mixed model {lme4}
 8 random Subject-(Intercept)-Days       5 SD          0.0656 Linear mixed model {lme4}
 9 random Residual                       6 var       655.     Linear mixed model {lme4}
10 random Residual                       6 SD         25.6    Linear mixed model {lme4}
11 fixed  (Intercept)                    7 estimate  251.     Linear mixed model {lme4}
12 fixed  (Intercept)                    7 SE          6.82   Linear mixed model {lme4}
13 fixed  (Intercept)                    7 t          36.8    Linear mixed model {lme4}
14 fixed  Days                           8 estimate   10.5    Linear mixed model {lme4}
15 fixed  Days                           8 SE          1.55   Linear mixed model {lme4}
16 fixed  Days                           8 t           6.77   Linear mixed model {lme4}

The broom package returns, I believe, only the fixed or random components from a multilevel model, thereby showing it is more difficult to put everything into 1 data frame when the results are not totally tidy.

So, I think tidystats might offer a way to get at your generalized solution for reporting statistics. I think there's a lot left to think about, but currently I see some hope that tidystats might serve a similar function as broom and be used under the hood for packages like papaja.

WillemSleegers avatar Aug 16 '18 13:08 WillemSleegers