dataMaid icon indicating copy to clipboard operation
dataMaid copied to clipboard

Support list columns

Open richierocks opened this issue 7 years ago • 3 comments

List columns can cause errors in clean().

library(tibble)
d <- data_frame(x = as.list(rep(1:2, 5)))
clean(d, replace = TRUE)
## Error in UseMethod("check") : 
##   no applicable method for 'check' applied to an object of class "list"
## Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
##   invalid 'row.names' length
## Data cleaning is finished. Please wait while your output file is being rendered.

richierocks avatar Mar 29 '17 15:03 richierocks

I am having a hard time coming up with ideas for relevant checks and summaries to perform on (all) lists. The very core idea of dataMaid is to perform a standard suite of checks for each variable class. Do you have any suggestions for relevant checks for lists in mind yourself? Or did you perhaps have a specific example in mind, when you opened this issue?

annennenne avatar Mar 31 '17 11:03 annennenne

If you have a list column inside a data frame, you typically want each element to have the same form. For example, if you call strsplit(), then the output is a list of character vectors, and you might want to store this as a field in a data frame.

So some useful checks on list columns are "Does each element have the same class/typeof/length/dim?".

richierocks avatar Mar 31 '17 14:03 richierocks

I do see the point in your concrete example, but I'm concerned that other people would use lists differently in datasets. Personally, I would usually choose to store something in a list (rather than a vector) exactly because the entries were of different data types or varying lengths, and even though that does not instantly generalize to the role of lists in data.frames, I imagine there are others that think like me. So if clean() tests e.g. that all elements in a list variable have the same class, length and dimensions, I can easily imagine that all list variables would almost always be marked as problematic, as one rarely wants all of those features simultaneously... The list is such a flexible class that I'm afraid standardized problem flagging is simply not a suitable strategy when looking for mistakes.

I will consider implementing a list extension for dataMaid, possibly with no default check/summarize/visualize functions, so that it is up for each (advanced) user to implement the tools he/she needs. And in either case, I will implement a check for variable class soon so that the error will be replaced by an informative message in the outputted report.

annennenne avatar Apr 05 '17 15:04 annennenne