jasyncapi
jasyncapi copied to clipboard
Exploration and visualisation of missing data
In my PhD research I work with medical data and there are often large amounts of it missing. In my attempts to explore missing data problems and make my life easier I have done some work on two packages: ggmissing
with Di Cook, and mex
with Damjan Vukcevic. But, as my PhD research continues, I have been finding it hard to dedicate some serious time to continue work on these packages.
I'd like to propose a project on one, or perhaps both of these packages.
A bit more about them:
ggmissing
extends ggplot to allow for missing data to be visualised. This would basically involve creating a couple of ggplot geom_missing_*
functions that could be added as a layer to a plot. For example, geom_missing_point()
would add in and colour the missing points. You can see more about it on the github repo, and at these slides.
mex
is a missingness exploration package. This extends off of some research that I have done into using decision trees to explore missing data. The original idea of the package was to create a framework or even a recommended path for handling missing data. One idea was to break it into exploring, modelling, and confirming.
Exploring would include:
- Creating a better, fast version of Little's MCAR test
- Tabulation of missing data
- use of t-tests/chi^2 to explore whether missingness affects values/counts
- Tools and variations on function from previous work in packages like MissingDataGUI
- Incorporating visualisations from
visdat
Modelling would include:
- Using machine learning methods to explore missing data.
- Identifying clusters of missing data and then predicting these clusters with machine learning methods
Confirming might be something like:
- Using cross validation to explore how accurate the missing data mechanism is
I'm very much open to suggestions about how to implement these ideas.
Snap! I have medical data with missing entries too. I'm interested in being able to visual it and explore clusters of missingness as well as other types of data inconsistencies (e.g. end time before start time). I am hoping to bring a mockup of the kind of datasets that I use at work.
The mice
package (Multivariate Imputation by Chained Equations in R) has some good tools for imputation (MCAR/otherwise).
Also have a look at VIM::aggr
for producing a neat plot of missing data.
e.g. http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
Thank Jonno,
Thanks for that, VIM certainly does have some useful plots, what do you think about incorporating them into ggmissing?
- Summary of missingness (like the norm or MissingDataGUI package)
- Missingness map: more options for setting order of rows and columns that work for large data
- Vignette
- Enable imputed values to position the points
Keep the package simple. Primary purpose is to make ggplot2 graphics that include the missings in the plot.
I'm not very familiar with ggmissing, but I'd like to know more about it!
BTW, here is a nice example of a scatterplot with margins for missing values http://kbroman.org/d3panels/assets/test/scatterplot/
7 votes from the AuUnconf... :) Might be worth continuing discussions around this..
Nick created a channel on the AuUnconf slack account. Anyone interested can join discussions there also.