splitstackshape record is lost if splitCols contains empty value

Example of code to reproduce:

dt1 <- fread("V1 V2 V3
             x b;c;d 1
             y d;ef  2
             z d;ef  3
             m tmp   4
             n tmp   5")
dt1[4, V2:=''] # this record will be lost because V2 value is empty string
dt1[5, V2:=NA_character_] # NA value is processed correctly

cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long')

as you can see, record 4 (where V2=='') is lost.

Jan 17 '16 04:01 aushev

@aushev , does cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long', stripWhite = FALSE) do what you expect?

cSplit(dt1, splitCols = "V2", sep = ";", direction = "long", stripWhite = FALSE)
##    V1 V2 V3
## 1:  x  b  1
## 2:  x  c  1
## 3:  x  d  1
## 4:  y  d  2
## 5:  y ef  2
## 6:  z  d  3
## 7:  z ef  3
## 8:  m     4
## 9:  n NA  5

Jan 17 '16 06:01 mrdwab

@mrdwab, no, unfortunately it gives the same result

Jan 17 '16 06:01 aushev

@aushev , sorry. I was testing with different data--where the empty cells are replaced with " ". In other words, the approach would work with:

cSplit(replace(dt1, dt1 == "", " "), splitCols = "V2", sep = ";", direction = "long", stripWhite = FALSE)
##    V1 V2 V3
## 1:  x  b  1
## 2:  x  c  1
## 3:  x  d  1
## 4:  y  d  2
## 5:  y ef  2
## 6:  z  d  3
## 7:  z ef  3
## 8:  m     4
## 9:  n NA  5

That's a possible workaround.

This happens because the approach used when direction = "long" is essentially:

dt1[, list(unlist(strsplit(V2, ";"))), by = V1]
##    V1 V1
## 1:  x  b
## 2:  x  c
## 3:  x  d
## 4:  y  d
## 5:  y ef
## 6:  z  d
## 7:  z ef
## 8:  n NA

Jan 17 '16 06:01 mrdwab

yes, in my case I came to the same workaround: I replace empty values before calling cSplit(). But I prefer to replace not with whitespace but with sep - in this case you don't need to add stripWhite = FALSE:

cSplit(replace(dt1, dt1 == "", ";"), splitCols = "V2", sep = ";", direction = "long")

However, I still believe that current behaviour is not very good: without knowing about this issue, user can loose data. Maybe it's worth adding a warning, or at least add this info in the help section?

Jan 17 '16 06:01 aushev

@aushev , Sure. If you would like to propose the edits to the documentation via a pull request, that would be appreciated.

I should note, however, that this behavior is consistent with other approaches, like unnest from "tidyr":

library(dplyr)
library(tidyr)
dt1 %>% mutate(V2 = strsplit(V2, ";")) %>% unnest(V2)
## Source: local data frame [8 x 3]
## 
##      V1    V3    V2
##   (chr) (int) (chr)
## 1     x     1     b
## 2     x     1     c
## 3     x     1     d
## 4     y     2     d
## 5     y     2    ef
## 6     z     3     d
## 7     z     3    ef
## 8     n     5    NA

Jan 17 '16 06:01 mrdwab

encountered the same issue - it would be great to have an option of keeping rows with empty strings. thanks

Jun 10 '16 12:06 bilydr

@aushev, @bilydr, I haven't had a chance to do any work on this, but the tidyr package (version 0.5.0) has a new function called separate_rows which does what you are looking for.

Jun 14 '16 04:06 mrdwab

Will be fixed in the forthcoming version of cSplit:

cSplit_fread(dt1, "V2", ";", "long")
##    V1 V2 V3
## 1:  x  1  1
## 2:  x  2  1
## 3:  x  3  1
## 4:  y  1  2
## 5:  y  3  2
## 6:  z  2  3
## 7:  z  4  3
## 8:  m NA  4
## 9:  n NA  5

Note that all empty spaces are filled with NA.

Mar 08 '18 18:03 mrdwab

splitstackshape splitstackshape copied to clipboard

record is lost if splitCols contains empty value

splitstackshape
splitstackshape copied to clipboard