ArchGDAL.jl
ArchGDAL.jl copied to clipboard
Support PermutedDimsArray for interactive datasets
Hi, I noticed the following:
img = AG.readraster("somefile");
typeof(img)
# ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}
@time collect(img) # fast, 0.6s
imgp = permutedims(img, (3,1,2));
typeof(imgp)
# DiskArrays.PermutedDiskArray{Int16, 3, PermutedDimsArray{Int16, 3, (3, 1, 2), (2, 3, 1), ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}}}
# So underneath it's already a PermudedDimsArray somehow
@time collect(imgp) # fast 0.7s
imgpda = PermutedDimsArray(img, (3,1,2));
typeof(imgpda)
# PermutedDimsArray{Int16, 3, (3, 1, 2), (2, 3, 1), ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}}
@time collect(imgpda) # super slow, 43s
So I assume the PermutedDimsArray translates to super inefficient disk access. As ArchGDAL already handles the interactive dataset lazily with permutedims I propose the super simple fix:
Base.PermutedDimsArray(A::RasterDataset{Any, IDataset}, perm) = permutedims(A, perm)
Hmm, interesting.
Perhaps it's best to raise this issue with DiskArrays? cc @meggart
Over there permutedims is forwarded to PermutedDiskArray:
https://github.com/meggart/DiskArrays.jl/blob/7fa75d4a0eb216d7e25a6718b223b8533d0b92c4/src/permute_reshape.jl#L64-L67
Probably Base.PermutedDimsArray should be forwarded to the same (like your proposed fix, but in DiskArrays).
Probably Base.PermutedDimsArray should be forwarded to the same (like your proposed fix, but in DiskArrays).
This proposed fix does not work because the PermutedDiskArray already wraps a PermuteDimsArray, and just gives it disk-efficient access patterns, so there is something cyclic when you try to construct such a thing. One could solve this by adding some invoke calls, but in general I think it would also be strange if a constructor of PermutedDimsArray would return an object of a PermutedDiskArray.
What would speak against directly calling permutedims, which gives you the desired performance?
I really don't see another solution to this problem. If you wrap your array into a PermutedDiskArray, which is defined in Base, this is what you get and if this type assumes efficient random access to the underlying data it will be slow.
What would speak against directly calling
permutedims, which gives you the desired performance?
Not much. I just would have expected that it's equally fast. This finding resulted from trying to avoid eager loading from disk. Constructing a PermutedDimsArray is the canonical way to do things lazily (just like view for getindex), so I tried that first and failed. I thought maybe others might fall into that trap too, if they write "lazy" code. I leave it up to you to close this (and the other issue), because I have no further ideas how to handle this; returning a PermutedDiskArray sounds strange indeed.