DelayedArray icon indicating copy to clipboard operation
DelayedArray copied to clipboard

Reshaping Array

Open muschellij2 opened this issue 6 years ago • 18 comments

First, thanks for the package. It's exactly what we needed. We're working on neuroimaging data: https://github.com/muschellij2/niftiArray. We're trying to do statistics over multiple 3D arrays. I believe this doesn't fit well with DelayedMatrixStats as matrixStats requires 2D matrices.

Similar to https://github.com/Bioconductor/DelayedArray/issues/30 and the comment here https://github.com/Bioconductor/DelayedArray/issues/30#issuecomment-423683637, I was wondering about "reshaping" an array. Really, I want to be able to vectorize a DelayedArray or convert a 4D DelayedArray to a 2D matrix where the first 3 dimensions are the rows of the matrix and the 4th dimension is the columns. I can realize the array, convert to a matrix, and then create a DelayedArray or DelayedMatrix for the current solution.

Ideally, I'd be able to do this without realizing the array into memory. I'm not sure if HDF5 (the backend we're using) would support this, but we would it useful. Trying to set dim or change the dimensions fails due the (correct) checks on DelayedMatrix. Just wondering if it were possible, if it makes sense with HDF5, and if both are true, then would it be possible to implement? I can send a reprex if necessary, but the setting of dimensions is well documented.

muschellij2 avatar Jun 27 '19 21:06 muschellij2

I realized a MWE at https://github.com/Bioconductor/HDF5Array/issues/16, which presents a solution, but not satisfactory one.

muschellij2 avatar Jul 02 '19 00:07 muschellij2

Hi,

I've implemented the ReshapedHDF5Array class in HDF5Array 1.13.3 (HDF5Array 1.13.3 belongs to BioC 3.10, which is the current development version of Bioconductor).

The ReshapedHDF5Array class is a DelayedArray subclass for representing an HDF5 dataset with a user-supplied upfront virtual reshaping. See ?ReshapedHDF5Array for more information. Hopefully this will help with your use case.

FWIW I initially tried to implement the reshaping as a delayed operation. This would be a better solution because it would be backend agnostic and the reshaping could then be applied at anytime i.e. even after the DelayedArray object has gone thru some delayed operations like subsetting, cbind(), aperm(), etc... However I quickly realized that this was going to be a lot more complicated than what I anticipated. So I opted for the ReshapedHDF5Array solution. It's simpler to implement but, unfortunately, not as satisfying as the delayed op approach.

H.

hpages avatar Jul 02 '19 11:07 hpages

Can't get the install. Tried but https://github.com/Bioconductor/HDF5Array/blob/master/DESCRIPTION#L18 has 0.11.4 requirement for DelayedArray, but master branch has it at 0.11.3: https://github.com/Bioconductor/DelayedArray/blob/master/DESCRIPTION#L11. Tried Bioc devel, but still older version.

BiocManager::install("HDF5Array", version = "devel", ask = FALSE)
#> Bioconductor version 3.10 (BiocManager 1.30.4), R 3.6.0 (2019-04-26)
#> Installing package(s) 'HDF5Array'
#> 
#> The downloaded binary packages are in
#>  /var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//Rtmp5mQ9Od/downloaded_packages
packageVersion("HDF5Array")
#> [1] '1.13.2'

Created on 2019-07-02 by the reprex package (v0.3.0)

Just getting used to Bioconductor workflow, so I'm likely missing the repo I could find the updated DelayedArray.

muschellij2 avatar Jul 02 '19 14:07 muschellij2

Oops, my bad sorry. I meant to have this latest version of HDF5Array (1.13.3) depend on DelayedArray >= 0.11.3, not on DelayedArray >= 0.11.4. Sorry for that.

(FWIW I have a local version of DelayedArray on my laptop that is at version 0.11.4, with some uncommitted changes to it, hence my confusion.)

I just adjusted the requirements in HDF5Array: https://github.com/Bioconductor/HDF5Array/commit/f2f28bc58a3dbe69c409bf4f2f6275bdf8bda38e

Please try again.

BTW only a few minutes after pushing the new ReshapedHDF5Array class to HDF5Array, I realized that it wouldn't actually be so hard to implement the reshaping as a delayed operation. I'd like to do it at some point. Then the ReshapedHDF5Array stuff will no longer be needed so maybe I'll remove it.

hpages avatar Jul 02 '19 22:07 hpages

That's great. I will hold off on adding any capabilities on my end then. John

On Tue, Jul 2, 2019 at 6:38 PM Hervé Pagès [email protected] wrote:

Oops, my bad sorry. I meant to have this latest version of HDF5Array (1.13.3) depend on DelayedArray >= 0.11.3, not on DelayedArray >= 0.11.4. Sorry for that.

(FWIW I have a local version of DelayedArray on my laptop that is at version 0.11.4, with some uncommitted changes to it, hence my confusion.)

I just adjusted the requirements in HDF5Array: Bioconductor/HDF5Array@ f2f28bc https://github.com/Bioconductor/HDF5Array/commit/f2f28bc58a3dbe69c409bf4f2f6275bdf8bda38e

Please try again.

BTW only a few minutes after pushing the new ReshapedHDF5Array class to HDF5Array, I realized that it wouldn't actually be so hard to implement the reshaping as a delayed operation. I'd like to do it at some point. Then the ReshapedHDF5Array stuff will no longer be needed so maybe I'll remove it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/DelayedArray/issues/47?email_source=notifications&email_token=AAIGPLXMK37PIJMPMC6DPYTP5PKERA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZCXXFI#issuecomment-507870101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIGPLXHJW6XT3VLICTJ46DP5PKERANCNFSM4H4AJDPQ .

muschellij2 avatar Jul 02 '19 23:07 muschellij2

Also, another check that I found is that to take an array to matrix with 1 column, I needed to add in a 1 to the dimensions of the data (e.g.

dim(from) = c(dim(from), 1)

For example, if dim(from) = c(10, 5, 4) and I want dim(from) = c(200, 1) I would have to make dim(from) = c(10, 5, 4, 1) first.

muschellij2 avatar Jul 03 '19 17:07 muschellij2

This 2-step reshaping is a consequence of a restriction on the kind of reshaping supported by the ReshapedHDF5Array() constructor. The restriction is documented in ?ReshapedHDF5Array:

Also please note that arbitrary reshapings are not supported.
Only reshapings that reduce the number of dimensions by collapsing
a group of consecutive dimensions into a single dimension are supported.
For example, reshaping a 10 x 3 x 5 x 1000 array as a 10 x 15 x 1000
array or as a 150 x 1000 matrix is supported.

The important bit here is "Only reshapings that reduce the number of dimensions". So adding dimensions would have to be done as a separate operation (delayed) on the object returned by the ReshapedHDF5Array() constructor.

Note that the 2-step reshaping will go away once reshaping is implemented as a delayed operation.

BTW I'm curious about what benefit you're hoping for by fully unfolding a 3-dimensional dataset into a 1-dimensional DelayedArray object or into a Nx1 DelayedMatrix object. In particular, I can't think of any DelayedMatrixStats operation that would make sense on this Nx1 matrix.

hpages avatar Jul 03 '19 18:07 hpages

Good point and also good question. Setup, we have population of 3D arrays of images of the brain (let's say 1000). We want to get an average brain/template. Setup is take 1000 brains in imaging format, put into NiftiArray (aka HDF5Array with additional info). Now take 1000 NiftiArray objects, reshape to Nx1 image, cbind together (using acbind), then run DelayedMatrixStats::rowMeans2 or DelayedMatrixStats::rowMedians to get vector, reshape to get array (this can be done in memory), create population level image, then write out in imaging format.

Current memory issues is that one image can be 512x512x512, so N = 134217728, so Nx1000 can be memory prohibitive.

John

On Wed, Jul 3, 2019 at 2:04 PM Hervé Pagès [email protected] wrote:

This 2-step reshaping is a consequence of a restriction on the kind of reshaping supported by the ReshapedHDF5Array() constructor. The restriction is documented in ?ReshapedHDF5Array:

Also please note that arbitrary reshapings are not supported. Only reshapings that reduce the number of dimensions by collapsing a group of consecutive dimensions into a single dimension are supported. For example, reshaping a 10 x 3 x 5 x 1000 array as a 10 x 15 x 1000 array or as a 150 x 1000 matrix is supported.

The important bit here is "Only reshapings that reduce the number of dimensions". So adding dimensions would have to be done as a separate operation (delayed) on the object returned by the ReshapedHDF5Array() constructor.

Note that the 2-step reshaping will go away once reshaping is implemented as a delayed operation.

BTW I'm curious about what benefit you're hoping for by fully unfolding a 3-dimensional dataset into a 1-dimensional DelayedArray object or into a Nx1 DelayedMatrix object. In particular, I can't think of any DelayedMatrixStats operation that would make sense on this Nx1 matrix.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/DelayedArray/issues/47?email_source=notifications&email_token=AAIGPLQ67LB4SM5CUEWAB2LP5TS4XA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFH3VQ#issuecomment-508198358, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIGPLSQH3ZKEY452JACY4LP5TS4XANCNFSM4H4AJDPQ .

muschellij2 avatar Jul 03 '19 18:07 muschellij2

I see. Interesting use case. Note that if the 1000 3D datasets were stored as a single 4D dataset (512x512x512x1000) then things would be much easier. It would just be a matter of creating a ReshapedHDF5Matrix object with something like ReshapedHDF5Array(filepath, name, dim=c(512*512*512, 1000)) and feed that to DelayedMatrixStats::rowMeans2() or DelayedMatrixStats::rowMedians(). This would avoid the overhead of the 1000 reshapipngs + final cbind'ing of 1000 objects that you need to do with separate files. I suspect that this overhead is going to introduce some significant slowdown.

H.

hpages avatar Jul 03 '19 19:07 hpages

I can bind them together (if from is a list of arrays) (where ndims = 3): from = lapply(from, function(x) { dim(x) = c(dim(x), 1) x = DelayedArray::aperm(x, perm = (ndims + 1):1) x }) res = do.call(DelayedArray::arbind, args = from) res = aperm(res, (ndims + 1):1)

which will create a 512x512x512x1000 delayedArray

John

On Wed, Jul 3, 2019 at 3:26 PM Hervé Pagès [email protected] wrote:

I see. Interesting use case. Note that if the 1000 3D datasets were stored as a single 4D dataset (512x512x512x1000) then things would be much easier. It would just be a matter of creating a ReshapedHDF5Matrix object with something like ReshapedHDF5Array(filepath, name, dim=c(512512512, 1000)) and feed that to DelayedMatrixStats::rowMedians() or DelayedMatrixStats::rowMedians(). This would avoid the overhead of the 1000 reshapipngs + final cbind'ing of 1000 objects that you need to do with separate files. I suspect that this overhead is going to introduce some significant slowdown.

H.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/DelayedArray/issues/47?email_source=notifications&email_token=AAIGPLV3JSOMX65VKM4YSSLP5T4NZA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFOPKI#issuecomment-508225449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIGPLVFHIIT75ESCRHVJPDP5T4NZANCNFSM4H4AJDPQ .

muschellij2 avatar Jul 03 '19 19:07 muschellij2

Something that maybe would help is the ability to "stack" the 1000 3D datasets into a 4D dataset. That would need to be a new array operation so we would need to come up with a new verb for it (I don't think there is any equivalent in base R for doing this on ordinary arrays). This "stacking" is something I actually had in mind for a while but we didn't really have a use case for it (until now) so I lacked the incentive for implementing it. I don't anticipate it would be too hard to implement this "stacking" as a delayed operation though. One easy way would be to add the extra dimension to all the objects to stack (with dim(x) <- c(dim(x), 1)) and then to abind() them together along the last dimension.

So for your use case you could stack the 1000 3D datasets into a single 4D dataset (512x512x512x1000), then reshape into a Nx1000 DelayedMatrix (once this reshaping is available as a delayed operation). I believe this could be a little bit more efficient than the ReshapedHDF5Array approach. Not sure, would be interesting to compare.

hpages avatar Jul 03 '19 19:07 hpages

I'm just seeing your last post now. Yes, nice trick to use a combination of aperm/arbind to achieve binding along an arbitrary dimension. That's what my proposed abind() above would do (except that it would use a native implementation so would avoid the overhead of permuting the dimensions back and forth).

hpages avatar Jul 03 '19 20:07 hpages

Also, try to use intermediate realizations to circumvent some of the current limitations. For example after you've stacked the 1000 datasets together, you won't be able to reshape the result into an Nx1000 matrix (because the delayed reshaping is not supported yet) but you can realize it to a new HDF5 file. Then you can use ReshapedHDF5Array() on that new HDF5 file to reshape into a matrix. You will also pay much less overhead now when you call DelayedMatrixStats functions on this object because it doesn't carry any delayed op (except for the reshaping which can be seen as a seed built-in delayed op).

hpages avatar Jul 03 '19 20:07 hpages

I'll take a look at it after the weekend. We don't need it exactly right now but that is the use case we're going to use to convince our imaging R partners to get on board with the *Array family. Also, the random access and memory footprint stuff allow us to push into the Shiny domain without insane lagging for the most part. We're going to write something up in the coming weeks, and would be more than happy to include you in the effort if that is of interest.

On Wed, Jul 3, 2019 at 4:09 PM Hervé Pagès [email protected] wrote:

Also, try to use intermediate realizations to circumvent some of the current limitations. For example after you've stacked the 1000 datasets together, you won't be able to reshape it into an Nx1000 matrix (because the delayed reshaping is not supported yet) but you can realize it to a new HDF5 file. Then you can use ReshapedHDF5Array() on that new HDF5 file to reshape. You will also pay much less overhead now when you call DelayedMatrixStats functions on this object because it doesn't carry any delayed op (except for the reshaping which can be seen as a seed built-in delayed op).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/DelayedArray/issues/47?email_source=notifications&email_token=AAIGPLRAI2M2ATK4XERYRM3P5UBOTA5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFRPNY#issuecomment-508237751, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIGPLRCVW7K6XQLUFQVL5LP5UBOTANCNFSM4H4AJDPQ .

muschellij2 avatar Jul 03 '19 20:07 muschellij2

Sounds like an interesting project.

Thanks for the offer. I would love to get more involved but I'm overbooked with too many other projects going on at the moment. Also I'll be on vacation from July 10 to July 28 so won't be able to do any significant work on the "delayed reshaping" before August.

One more thing about using an intermediate realization of the big 512x512x512x1000 dataset: in addition to the benefits I mentioned above, this also gives you the opportunity to tweak a few settings like the chunk geometry, level of compression, etc... These physical parameters can have an impact later depending on how you will typically operate on the dataset. For example if the typical access pattern is going to be doing row summarizations on the Nx1000 ReshapedHDF5Matrix object, then choosing chunks that contain full rows might be sensible. This could be achieved with 512x1x1x1000 chunks or any chunk geometry of the form XxYxZx1000. Note that chunks should not be too big (<= 1M array elements) and not too small (>= 10k array elements). Finding the sweet spot might require experimenting a little bit (the exact sweet spot will also depend on the level of compression).

These settings can be controlled thru writeHDF5Array's arguments. See ?writeHDF5Array for the details.

hpages avatar Jul 03 '19 20:07 hpages

Got it. So should I rely on ReshapedHDF5Matrix and ReshapedHDF5Array staying classes for the coming future or should I wait until the process is given as a delayed op? John

On Wed, Jul 3, 2019 at 4:47 PM Hervé Pagès [email protected] wrote:

Sounds like an interesting project.

Thanks for the offer. I would love to get more involved but I'm overbooked with too many other projects going on at the moment. Also I'll be on vacation from July 10 to July 28 so won't be able to do any significant work on the "delayed reshaping" before August.

One more thing about using an intermediate realization of the big 512x512x512x1000 dataset: in addition to the benefits I mentioned above, this also gives you the opportunity to tweak a few settings like the chunk geometry, level of compression, etc... These physical parameters can have an impact later depending on how you will typically operate on the dataset. For example if the typical access pattern is going to be doing row summarizations on the Nx1000 ReshapedHDF5Matrix object, then choosing chunks that contain full rows might be sensible. This could be achieved with 512x1x1x1000 chunks or any chunk geometry of the form XxYxZx1000. Note that chunks should not be too big (<= 1M array elements) and not too small (>= 10k array elements). Finding the sweet spot might require experimenting a little bit (the exact sweet spot will also depend on the level of compression).

These settings can be controlled thru writeHDF5Array's arguments. See ?writeHDF5Array for the details.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/DelayedArray/issues/47?email_source=notifications&email_token=AAIGPLWGADS7M6MZAAAEFLLP5UF67A5CNFSM4H4AJDP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFUJZY#issuecomment-508249319, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIGPLRTBJVUBMMRYP6OLO3P5UF67ANCNFSM4H4AJDPQ .

muschellij2 avatar Jul 03 '19 21:07 muschellij2

No reason to wait. When reshaping becomes available as a delayed op, all you'll need to do is replace:

A <- ReshapedHDF5Array(filepath, name, dim=c(512*512*512, 1000))

with

A <- HDF5Array(filepath, name)
dim(A) <- c(512*512*512, 1000)

so it'll be a really simple change.

Plus, I'll wait that you get a chance to make these changes before I remove the ReshapedHDF5Array stuff. If I don't manage to get to implement the delayed reshaping before the next Bioconductor release (in fall), then the ReshapedHDF5Array stuff will have to stick around for at least 6 more months (it will have to go thru a 6-month deprecation cycle before I can remove it).

However the lack of delayed reshaping would be a blocking issue for you if for some reason you didn't want to realize the big intermediate 4D dataset.

hpages avatar Jul 03 '19 21:07 hpages

Another request/use case for delayed reshaping here: https://support.bioconductor.org/p/9136602/#9136860 They're also asking if delayed reshaping could increase the number of dimensions e.g. go from 2 x 12 to 2 x 3 x 4.

hpages avatar May 18 '21 13:05 hpages