splitstackshape
splitstackshape copied to clipboard
Option to add colnames to new columns
first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:
to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated",
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated",
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L,
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA,
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
# Reads Sample_1 Sample_2 Sample_3 Sample_4
# 1: 470987 N2 wt rep1 untreated
# 2: 270891 N2 wt rep1 untreated
# 3: 56114 N2 wt rep1 untreated
# 4: 513902 N2 wt rep2 untreated
# 5: 310722 N2 wt rep2 untreated
# 6: 67263 N2 wt rep2 untreated
The new col names are not very informative, so I usually rename them in an extra step:
setnames(split,
c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
c("Background", "Allele", "Replicate", "Treatment")
)
This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")
Cheers.
Thanks @adomingues for the comment. I've thought about this in the past. It shouldn't be too difficult to implement, so I'll look into it again.
Here are a couple of reasons I didn't implement it the first time around:
- The
cSplit
function is generalized in the sense that I should be able to split a column not knowing how many columns would be in the result. - The
cSplit
function is vectorized, so a simplenew_names = c(...)
wouldn't work--it would have to be something likelist(Sample = c("Background", "Allele", "Replicate", "Treatment")
Any thoughts on those?
Thanks for considering this @mrdwab. I was think about implementation, after posting and my very näive thought was to operate on the colnames
after spliting. For instance greping the colnames and replacing only those:
cSplit2 <- function(indt, splitCols, newNames, ...){
split <- cSplit(to_split, "Sample", sep="_")
newcols <- grep(paste(splitCols, collapse="|"), colnames(split))
colnames(split)[newcols] <- newNames
return(split)
}
cSplit2(to_split, splitCols = "Sample", sep="_", newNames = c("Background", "Allele", "Replicate", "Treatment"))
This is of course of the opposite of what you suggested :) but I wonder it would be a good starting point.
@adomingues, Here's a POC renamer
function that I can probably drop-in at the last stages of the existing cSplit
function. Here, I'm just demonstrating it as an external function:
library(splitstackshape)
library(data.table)
df <- data.frame(x = 1:3, y = c("a", "d,e", "g,h"), z = c("1", "2,3,4", "6"))
renamer <- function(data, replacement) {
if (!is.list(replacement)) stop("replacement should be a named list")
for (i in seq_along(replacement)) {
old <- names(data)[startsWith(names(data), names(replacement)[i])]
setnames(data, old = old, new = replacement[[i]])
}
data[]
}
cSplit(df, c("y", "z"))
# x y_1 y_2 z_1 z_2 z_3
# 1: 1 a <NA> 1 NA NA
# 2: 2 d e 2 3 4
# 3: 3 g h 6 NA NA
renamer(cSplit(df, c("y", "z")),
list(y = c("A", "B"), z = c("AA", "BB", "CC")))
# x A B AA BB CC
# 1: 1 a <NA> 1 NA NA
# 2: 2 d e 2 3 4
# 3: 3 g h 6 NA NA
So, a possible final implementation might look like:
cSplit(df, c("y", "z"), sep = ",", new_names = list(y = c("A", "B"), z = c("AA", "BB", "CC")))
Alternatively, the entire API can be revisited such that, depending on the input, the function behaves differently:
- If a simple character string of column names is provided, use the current approach.
- If a
list
is provided in thesplitCols
argument, new names can be specified (eg:cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ",")
)
Let me think about it some more, but I'm open to other ideas as well as I'm currently planning a V2 release of the package later this year.
If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))
This pretty much solves it, at least for me. Looking forward to V2.