splitstackshape
splitstackshape copied to clipboard
how cSplit() treats multiple splitCols when they contain different number of fields
Here is example table:
dt1 <- fread("V1 V2 V3
x xA;xB;xC x1;x2;x3
y yD y1
z zF;zG z1")
and I want to split it by both V2
and V3
columns. You can see that the last record is "wrong": V2
has 2 values while V3
has only one. And that how cSplit()
treats those cases:
# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: y NA NA
#6: y NA NA
#7: z zF z1
#8: z zG NA
#9: z NA NA
# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: y NA NA
#6: y NA NA
#7: z zF z1
#8: z zG NA
#9: z NA NA
So, by default it works like with makeEqual = TRUE
while in the help it is said Defaults to FALSE
. Then I tried with FALSE
:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
# Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
# V1 V2 V3
# 1: x xA x1
# 2: x xB x2
# 3: x xC x3
# 4: y yD y1
# 5: z zF z1
# 6: z zG x1
It recycles V3
elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:
# without recycling, fill with NA:
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: z zF z1
#6: z zG NA
# with recycling:
# V1 V2 V3
#1: x xA x1
#2: x xB x2
#3: x xC x3
#4: y yD y1
#5: z zF z1
#6: z zG z1
@aushev Thanks for this. I must have missed something when I had rewritten the function. Time to write some tests, I guess!
A workaround, until I get a chance to look into this further, would be to make use of is.na
, for instance with is.na(V2) & is.na(V3)
.
Demo, with your sample data:
cSplit(dt1, c("V2", "V3"), ";", "long")[!(is.na(V2) & is.na(V3))]
# V1 V2 V3
# 1: x xA x1
# 2: x xB x2
# 3: x xC x3
# 4: y yD y1
# 5: z zF z1
# 6: z zG NA
Thanks for a workaround! (and will be looking forward for a new version :) )
Note to self: Test performance of the following:
cols <- paste0("V", 2:ncol(dt1))
fun1 <- function() {
cSplit(dt1, cols, ";", "long")[rowSums(
is.na(do.call(cbind, mget(cols)))) != length(cols)]
}
fun2 <- function() {
cSplit(dt1, cols, ";", "long")[
, nas := Reduce("+", lapply(.SD, is.na)),
.SDcols = cols][nas != length(cols)][
, nas := NULL][]
}
fun3 <- function() {
temp <- cSplit(dt1, cols, ";", "long")
temp[!(rowSums(is.na(temp[, cols, with = FALSE])) == length(cols))]
}
## This one might be the best....
fun4 <- function() {
temp <- cSplit(dt1, cols, ";", "long")
temp[Reduce("+", lapply(temp[, cols, with = FALSE], is.na)) != length(cols)]
}
There can also be a function like this:
long_fixer <- function(indt, cols) {
indt[Reduce("+", lapply(indt[, cols, with = FALSE], is.na)) != length(cols)]
}
## Usage
cols <- c("V2", "V3")
long_fixer(cSplit(dt1, cols, sep=';', 'long'), cols)
# V1 V2 V3
# 1: x xA x1
# 2: x xB x2
# 3: x xC x3
# 4: y yD y1
# 5: z zF z1
# 6: z zG NA
@aushev long_fixer
might be a more convenient temporary workaround for you.
Will be fixed with the next version of cSplit()
:
dt2 <- fread("V1 V2 V3
x xA;xB;xC x1;x2;x3
y yD y1
z zF;zG z1")
# Expect 6 x 3, NA at [6, 3]
cSplit_fread(dt2, c("V2", "V3"), ";")
# V1 V2 V3
# 1: x xA x1
# 2: x xB x2
# 3: x xC x3
# 4: y yD y1
# 5: z zF z1
# 6: z zG NA