IRanges
IRanges copied to clipboard
Making the CompressedList more widely usable
I've been playing around with the CompressedList
subclasses for representing some complex data types and I've really come to like it. I've been thinking of ways to make it more generally usable by both end-users and other developers, and I've got a few wish-list elements:
Access to unlistData
and partitioning
. End-users would then be able to execute arbitrary unary operations on the underlying data while preserving the partitioning, like:
A <- DataFrame(X=LETTERS, Y=runif(26))
comp.list <- split(A,A$X)
# Attempt fails, for obvious reasons.
comp.list$Y <- log(comp.list$Y)
# Assuming we had a unlistData() method:
unlistData(comp.list)$Y <- log(unlistData(comp.list)$Y)
unlistData<-
could even be unlist<-
, if one were willing to introduce that concept. I don't mind if partitioning
is getter-only; this would still be very useful for downstream functions that need to be list-aware yet don't want to create an intermediate list
for efficiency purposes.
Non-virtual CompressedList
class. I don't understand the motivation for making CompressedList
virtual. From a representation perspective, a general concrete class would be useful if we could store any vector-like entity in unlistData
. In fact, I ran into the case where I wanted to store a CompressedCharacterList
as unlistData
, effectively making a CompressedCompressedCharacterListList
! I don't expect to be able to call many methods on this thing - other than the proposed unlistData
and partitioning
, and maybe unlist
- I just want to use it for storage without needing to write an explicit subclass. A general CompressedList
class would serve this purpose, and is better than the alternative of falling back to a SimpleList
(which takes a noticeable time to generate).
A more careful unlist
. If we do allow a general CompressedList
class, the unlist
method should probably take heed of recursive=TRUE
and apply unlist
on the unlistData
slot.
I'm happy to chip in with a PR if these sound like good ideas.
An unlist<-()
has been suggested in the past (e.g., by @mtmorgan). It's a probably worth having but up until now we have managed by just adding methods for functions like log()
and using relist()
directly. We would welcome a pull request for unlist<-()
, but it would also be nice to have log()
and related methods for NumericList.
Making CompressedList non-virtual (and requiring Vector for @unlistData
) is an interesting idea. A separate pull request is welcome, if only to spur discussion. Agree that it should consider recursive=
.
Note that relist()
already does what the proposed unlist<-
would do on a CompressedList. So IIUC basically the unlist<-
proposal would be to replace well-established idiom:
relist(as.character(unlist(x)), x)
with
unlist(x) <- as.character(unlist(x))
Personally I prefer to stick to relist()
for several reasons:
- It's a base R verb that everybody is already familiar with.
- The
relist(as.character(unlist(x)), x)
idiom is more readable (but that's just my opinion). - It's also a powerful idiom that can be used on list-like objects in general (including ordinary lists), not just on CompressedList objects (unless the proposal is to generalize
unlist<-
to all list-like objects, but that's whatrelist()
does already).
I do like the symmetry, simplicity and safety of the unlist<-()
syntax. relist()
is in base, but it's fairly obscure. If we move forward, we should definitely make unlist<-()
work on all types of lists.
Indeed, I didn't even know about relist
until a few days ago when I was poking around inside IRanges.
My motivation for unlist<-
is mostly driven by use with CompressedSplitDataFrameLists
, where it provides a simple mechanism for switching between List
-mode and DataFrame
-mode.
library(IRanges)
X <- DataFrame(statistic=runif(100), more_stats=rnorm(100))
Y <- split(X, sample(LETTERS, 100, replace=TRUE))
# List-style getter/setter:
Y$statistic <- Y$statistic * 2
# Hypothetical DataFrame-style getter/setter:
unlist(Y)$statistic <- log2(unlist(Y)$statistic)
The relist
syntax would require an explicit intermediate DataFrame
. I guess you could argue that this is clearer, but it's inconvenient to have to split it across three lines (especially in interactive sessions).
Z <- unlist(Y)
Z$statistic <- log2(Z$statistic)
Y <- relist(Z, Y)
A recursive unlist<-
also provides an approach for reaching deep into nested CompressedList
objects, if it were possible to store a CompressedList
as the unlistData
of another CompressedList
:
basic <- sample(LETTERS, 100, replace=TRUE)
nest1 <- CharacterList(split(basic, sample(10, length(basic), replace=TRUE)))
# Create a list of CompressedList instances (if CompressedList() existed)
nest2 <- CompressedList(split(nest1, sample(3, length(nest1), replace=TRUE)))
# recursive=TRUE as default
unlist(nest2) <- paste0("WHEE", unlist(nest2))