dataMaid
dataMaid copied to clipboard
Support list columns
List columns can cause errors in clean()
.
library(tibble)
d <- data_frame(x = as.list(rep(1:2, 5)))
clean(d, replace = TRUE)
## Error in UseMethod("check") :
## no applicable method for 'check' applied to an object of class "list"
## Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
## invalid 'row.names' length
## Data cleaning is finished. Please wait while your output file is being rendered.
I am having a hard time coming up with ideas for relevant checks and summaries to perform on (all) lists. The very core idea of dataMaid is to perform a standard suite of checks for each variable class. Do you have any suggestions for relevant checks for lists in mind yourself? Or did you perhaps have a specific example in mind, when you opened this issue?
If you have a list column inside a data frame, you typically want each element to have the same form. For example, if you call strsplit(), then the output is a list of character vectors, and you might want to store this as a field in a data frame.
So some useful checks on list columns are "Does each element have the same class/typeof/length/dim?".
I do see the point in your concrete example, but I'm concerned that other people would use lists differently in datasets. Personally, I would usually choose to store something in a list (rather than a vector) exactly because the entries were of different data types or varying lengths, and even though that does not instantly generalize to the role of lists in data.frame
s, I imagine there are others that think like me. So if clean()
tests e.g. that all elements in a list
variable have the same class, length and dimensions, I can easily imagine that all list
variables would almost always be marked as problematic, as one rarely wants all of those features simultaneously... The list
is such a flexible class that I'm afraid standardized problem flagging is simply not a suitable strategy when looking for mistakes.
I will consider implementing a list
extension for dataMaid, possibly with no default check/summarize/visualize functions, so that it is up for each (advanced) user to implement the tools he/she needs. And in either case, I will implement a check for variable class soon so that the error will be replaced by an informative message in the outputted report.