jasyncapi icon indicating copy to clipboard operation
jasyncapi copied to clipboard

Exploration and visualisation of missing data

Open njtierney opened this issue 8 years ago • 7 comments

In my PhD research I work with medical data and there are often large amounts of it missing. In my attempts to explore missing data problems and make my life easier I have done some work on two packages: ggmissing with Di Cook, and mex with Damjan Vukcevic. But, as my PhD research continues, I have been finding it hard to dedicate some serious time to continue work on these packages.

I'd like to propose a project on one, or perhaps both of these packages.

A bit more about them:

ggmissing extends ggplot to allow for missing data to be visualised. This would basically involve creating a couple of ggplot geom_missing_* functions that could be added as a layer to a plot. For example, geom_missing_point() would add in and colour the missing points. You can see more about it on the github repo, and at these slides.

mex is a missingness exploration package. This extends off of some research that I have done into using decision trees to explore missing data. The original idea of the package was to create a framework or even a recommended path for handling missing data. One idea was to break it into exploring, modelling, and confirming.

Exploring would include:

  • Creating a better, fast version of Little's MCAR test
  • Tabulation of missing data
  • use of t-tests/chi^2 to explore whether missingness affects values/counts
  • Tools and variations on function from previous work in packages like MissingDataGUI
  • Incorporating visualisations from visdat

Modelling would include:

  • Using machine learning methods to explore missing data.
  • Identifying clusters of missing data and then predicting these clusters with machine learning methods

Confirming might be something like:

  • Using cross validation to explore how accurate the missing data mechanism is

I'm very much open to suggestions about how to implement these ideas.

njtierney avatar Mar 29 '16 13:03 njtierney

Snap! I have medical data with missing entries too. I'm interested in being able to visual it and explore clusters of missingness as well as other types of data inconsistencies (e.g. end time before start time). I am hoping to bring a mockup of the kind of datasets that I use at work.

greenLeopard avatar Mar 30 '16 04:03 greenLeopard

The mice package (Multivariate Imputation by Chained Equations in R) has some good tools for imputation (MCAR/otherwise).

Also have a look at VIM::aggr for producing a neat plot of missing data.

e.g. http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

jonocarroll avatar Mar 30 '16 04:03 jonocarroll

Thank Jonno,

Thanks for that, VIM certainly does have some useful plots, what do you think about incorporating them into ggmissing?

njtierney avatar Mar 30 '16 23:03 njtierney

  • Summary of missingness (like the norm or MissingDataGUI package)
  • Missingness map: more options for setting order of rows and columns that work for large data
  • Vignette
  • Enable imputed values to position the points

Keep the package simple. Primary purpose is to make ggplot2 graphics that include the missings in the plot.

dicook avatar Apr 05 '16 04:04 dicook

I'm not very familiar with ggmissing, but I'd like to know more about it!

BTW, here is a nice example of a scatterplot with margins for missing values http://kbroman.org/d3panels/assets/test/scatterplot/

cpsievert avatar Apr 05 '16 04:04 cpsievert

7 votes from the AuUnconf... :) Might be worth continuing discussions around this..

jesse-jesse avatar Apr 28 '16 04:04 jesse-jesse

Nick created a channel on the AuUnconf slack account. Anyone interested can join discussions there also.

jesse-jesse avatar Apr 28 '16 04:04 jesse-jesse