splitstackshape icon indicating copy to clipboard operation
splitstackshape copied to clipboard

how cSplit() treats multiple splitCols when they contain different number of fields

Open aushev opened this issue 8 years ago • 4 comments

Here is example table:

dt1 <- fread("V1 V2       V3
              x  xA;xB;xC x1;x2;x3
              y  yD       y1
              z  zF;zG    z1")

and I want to split it by both V2 and V3 columns. You can see that the last record is "wrong": V2 has 2 values while V3 has only one. And that how cSplit() treats those cases:

# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

So, by default it works like with makeEqual = TRUE while in the help it is said Defaults to FALSE. Then I tried with FALSE:

cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
#     Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
#      V1 V2 V3
#   1:  x xA x1
#   2:  x xB x2
#   3:  x xC x3
#   4:  y yD y1
#   5:  z zF z1
#   6:  z zG x1

It recycles V3 elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:

# without recycling, fill with NA:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG NA

# with recycling:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG z1

aushev avatar Mar 23 '16 06:03 aushev

@aushev Thanks for this. I must have missed something when I had rewritten the function. Time to write some tests, I guess!

A workaround, until I get a chance to look into this further, would be to make use of is.na, for instance with is.na(V2) & is.na(V3).

Demo, with your sample data:

cSplit(dt1, c("V2", "V3"), ";", "long")[!(is.na(V2) & is.na(V3))]
#    V1 V2 V3
# 1:  x xA x1
# 2:  x xB x2
# 3:  x xC x3
# 4:  y yD y1
# 5:  z zF z1
# 6:  z zG NA

mrdwab avatar Mar 23 '16 07:03 mrdwab

Thanks for a workaround! (and will be looking forward for a new version :) )

aushev avatar Mar 23 '16 07:03 aushev

Note to self: Test performance of the following:

cols <- paste0("V", 2:ncol(dt1))
fun1 <- function() {
  cSplit(dt1, cols, ";", "long")[rowSums(
    is.na(do.call(cbind, mget(cols)))) != length(cols)]
} 

fun2 <- function() {
  cSplit(dt1, cols, ";", "long")[
    , nas := Reduce("+", lapply(.SD, is.na)), 
    .SDcols = cols][nas != length(cols)][
      , nas := NULL][]
} 

fun3 <- function() {
  temp <- cSplit(dt1, cols, ";", "long")
  temp[!(rowSums(is.na(temp[, cols, with = FALSE])) == length(cols))]
}

## This one might be the best....
fun4 <- function() {
  temp <- cSplit(dt1, cols, ";", "long")
  temp[Reduce("+", lapply(temp[, cols, with = FALSE], is.na)) != length(cols)]
}

There can also be a function like this:

long_fixer <- function(indt, cols) {
  indt[Reduce("+", lapply(indt[, cols, with = FALSE], is.na)) != length(cols)]
}

## Usage

cols <- c("V2", "V3")
long_fixer(cSplit(dt1, cols, sep=';', 'long'), cols)
#    V1 V2 V3
# 1:  x xA x1
# 2:  x xB x2
# 3:  x xC x3
# 4:  y yD y1
# 5:  z zF z1
# 6:  z zG NA

@aushev long_fixer might be a more convenient temporary workaround for you.

mrdwab avatar Mar 23 '16 08:03 mrdwab

Will be fixed with the next version of cSplit():

dt2 <- fread("V1 V2       V3
              x  xA;xB;xC x1;x2;x3
              y  yD       y1
              z  zF;zG    z1")

# Expect 6 x 3, NA at [6, 3]
cSplit_fread(dt2, c("V2", "V3"), ";")
#    V1 V2 V3
# 1:  x xA x1
# 2:  x xB x2
# 3:  x xC x3
# 4:  y yD y1
# 5:  z zF z1
# 6:  z zG NA

mrdwab avatar Mar 10 '18 11:03 mrdwab