splitstackshape
splitstackshape copied to clipboard
record is lost if splitCols contains empty value
Example of code to reproduce:
dt1 <- fread("V1 V2 V3
x b;c;d 1
y d;ef 2
z d;ef 3
m tmp 4
n tmp 5")
dt1[4, V2:=''] # this record will be lost because V2 value is empty string
dt1[5, V2:=NA_character_] # NA value is processed correctly
cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long')
as you can see, record 4 (where V2=='') is lost.
@aushev , does cSplit(dt1, splitCols = 'V2', sep = ';', direction = 'long', stripWhite = FALSE)
do what you expect?
cSplit(dt1, splitCols = "V2", sep = ";", direction = "long", stripWhite = FALSE)
## V1 V2 V3
## 1: x b 1
## 2: x c 1
## 3: x d 1
## 4: y d 2
## 5: y ef 2
## 6: z d 3
## 7: z ef 3
## 8: m 4
## 9: n NA 5
@mrdwab, no, unfortunately it gives the same result
@aushev , sorry. I was testing with different data--where the empty cells are replaced with " "
. In other words, the approach would work with:
cSplit(replace(dt1, dt1 == "", " "), splitCols = "V2", sep = ";", direction = "long", stripWhite = FALSE)
## V1 V2 V3
## 1: x b 1
## 2: x c 1
## 3: x d 1
## 4: y d 2
## 5: y ef 2
## 6: z d 3
## 7: z ef 3
## 8: m 4
## 9: n NA 5
That's a possible workaround.
This happens because the approach used when direction = "long"
is essentially:
dt1[, list(unlist(strsplit(V2, ";"))), by = V1]
## V1 V1
## 1: x b
## 2: x c
## 3: x d
## 4: y d
## 5: y ef
## 6: z d
## 7: z ef
## 8: n NA
yes, in my case I came to the same workaround: I replace empty values before calling cSplit()
. But I prefer to replace not with whitespace but with sep
- in this case you don't need to add stripWhite = FALSE
:
cSplit(replace(dt1, dt1 == "", ";"), splitCols = "V2", sep = ";", direction = "long")
However, I still believe that current behaviour is not very good: without knowing about this issue, user can loose data. Maybe it's worth adding a warning, or at least add this info in the help section?
@aushev , Sure. If you would like to propose the edits to the documentation via a pull request, that would be appreciated.
I should note, however, that this behavior is consistent with other approaches, like unnest
from "tidyr":
library(dplyr)
library(tidyr)
dt1 %>% mutate(V2 = strsplit(V2, ";")) %>% unnest(V2)
## Source: local data frame [8 x 3]
##
## V1 V3 V2
## (chr) (int) (chr)
## 1 x 1 b
## 2 x 1 c
## 3 x 1 d
## 4 y 2 d
## 5 y 2 ef
## 6 z 3 d
## 7 z 3 ef
## 8 n 5 NA
encountered the same issue - it would be great to have an option of keeping rows with empty strings. thanks
@aushev, @bilydr, I haven't had a chance to do any work on this, but the tidyr
package (version 0.5.0) has a new function called separate_rows
which does what you are looking for.
Will be fixed in the forthcoming version of cSplit
:
cSplit_fread(dt1, "V2", ";", "long")
## V1 V2 V3
## 1: x 1 1
## 2: x 2 1
## 3: x 3 1
## 4: y 1 2
## 5: y 3 2
## 6: z 2 3
## 7: z 4 3
## 8: m NA 4
## 9: n NA 5
Note that all empty spaces are filled with NA
.