IRanges icon indicating copy to clipboard operation
IRanges copied to clipboard

Making the CompressedList more widely usable

Open LTLA opened this issue 5 years ago • 4 comments

I've been playing around with the CompressedList subclasses for representing some complex data types and I've really come to like it. I've been thinking of ways to make it more generally usable by both end-users and other developers, and I've got a few wish-list elements:

Access to unlistData and partitioning. End-users would then be able to execute arbitrary unary operations on the underlying data while preserving the partitioning, like:

A <- DataFrame(X=LETTERS, Y=runif(26))
comp.list <- split(A,A$X)

# Attempt fails, for obvious reasons.
comp.list$Y <- log(comp.list$Y)

# Assuming we had a unlistData() method:
unlistData(comp.list)$Y <- log(unlistData(comp.list)$Y)

unlistData<- could even be unlist<-, if one were willing to introduce that concept. I don't mind if partitioning is getter-only; this would still be very useful for downstream functions that need to be list-aware yet don't want to create an intermediate list for efficiency purposes.

Non-virtual CompressedList class. I don't understand the motivation for making CompressedList virtual. From a representation perspective, a general concrete class would be useful if we could store any vector-like entity in unlistData. In fact, I ran into the case where I wanted to store a CompressedCharacterList as unlistData, effectively making a CompressedCompressedCharacterListList! I don't expect to be able to call many methods on this thing - other than the proposed unlistData and partitioning, and maybe unlist - I just want to use it for storage without needing to write an explicit subclass. A general CompressedList class would serve this purpose, and is better than the alternative of falling back to a SimpleList (which takes a noticeable time to generate).

A more careful unlist. If we do allow a general CompressedList class, the unlist method should probably take heed of recursive=TRUE and apply unlist on the unlistData slot.

I'm happy to chip in with a PR if these sound like good ideas.

LTLA avatar Apr 02 '19 23:04 LTLA

An unlist<-() has been suggested in the past (e.g., by @mtmorgan). It's a probably worth having but up until now we have managed by just adding methods for functions like log() and using relist() directly. We would welcome a pull request for unlist<-(), but it would also be nice to have log() and related methods for NumericList.

Making CompressedList non-virtual (and requiring Vector for @unlistData) is an interesting idea. A separate pull request is welcome, if only to spur discussion. Agree that it should consider recursive=.

lawremi avatar Apr 06 '19 01:04 lawremi

Note that relist() already does what the proposed unlist<- would do on a CompressedList. So IIUC basically the unlist<- proposal would be to replace well-established idiom:

relist(as.character(unlist(x)), x)

with

unlist(x) <- as.character(unlist(x))

Personally I prefer to stick to relist() for several reasons:

  • It's a base R verb that everybody is already familiar with.
  • The relist(as.character(unlist(x)), x) idiom is more readable (but that's just my opinion).
  • It's also a powerful idiom that can be used on list-like objects in general (including ordinary lists), not just on CompressedList objects (unless the proposal is to generalize unlist<- to all list-like objects, but that's what relist() does already).

hpages avatar Apr 06 '19 21:04 hpages

I do like the symmetry, simplicity and safety of the unlist<-() syntax. relist() is in base, but it's fairly obscure. If we move forward, we should definitely make unlist<-() work on all types of lists.

lawremi avatar Apr 06 '19 22:04 lawremi

Indeed, I didn't even know about relist until a few days ago when I was poking around inside IRanges.

My motivation for unlist<- is mostly driven by use with CompressedSplitDataFrameLists, where it provides a simple mechanism for switching between List-mode and DataFrame-mode.

library(IRanges)
X <- DataFrame(statistic=runif(100), more_stats=rnorm(100))
Y <- split(X, sample(LETTERS, 100, replace=TRUE))

# List-style getter/setter:
Y$statistic <- Y$statistic * 2

# Hypothetical DataFrame-style getter/setter:
unlist(Y)$statistic <- log2(unlist(Y)$statistic)

The relist syntax would require an explicit intermediate DataFrame. I guess you could argue that this is clearer, but it's inconvenient to have to split it across three lines (especially in interactive sessions).

Z <- unlist(Y)
Z$statistic <- log2(Z$statistic)
Y <- relist(Z, Y)

A recursive unlist<- also provides an approach for reaching deep into nested CompressedList objects, if it were possible to store a CompressedList as the unlistData of another CompressedList:

basic <- sample(LETTERS, 100, replace=TRUE)
nest1 <- CharacterList(split(basic, sample(10, length(basic), replace=TRUE)))

# Create a list of CompressedList instances (if CompressedList() existed)
nest2 <- CompressedList(split(nest1, sample(3, length(nest1), replace=TRUE)))

# recursive=TRUE as default
unlist(nest2) <- paste0("WHEE", unlist(nest2))

LTLA avatar Apr 06 '19 23:04 LTLA