Arrays with Missing Hunks
Is it feasible that DArray could support distributed arrays in which pieces are missing?
For instance, say I have the following:
X X X
X X X X
X X
where each X represents a dense portion of an array and the blanks represent a portion which can be modeled as representing a single value throughout.
In my mind, DArray has some sort of address table and, when it does a lookup, it could note the gap and return an appropriately formatted response using the constant value.
Hm yes, that seems feasible. Sounds like you would want to use something like a https://github.com/JuliaArrays/FillArrays.jl, the only problem is that we don't support heterogeneous chunk-types. #170 as an example is actually about ensuring that the chunk-types by construction are consistent.
It might be feasible to do promote to a joint super-type, but that might break other things like similar. So yes feasible in theory, but not yet in practice.
Just for reference: for the data I'm currently working with it's the difference between 700 GB and 2 TB of RAM.
FillArrays looks pretty perfect. I wonder if you have suggestions of places I could look in the codebase if I wanted to try hacking on this?
#170 modifies exactly the core constructor that you would want to modify. Once you have a DArray that is constructed in the way you want it comes with the challenge of making sure that all the operations you are interested in work.
I'll take a dig.
There might be a cheap-hack approach, though: having multiple sections of the DArray reference the same underlying memory:
@everywhere using DistributedArrays
r1 = @spawnat 2 zeros(4,4)
r2 = @spawnat 3 zeros(4,4)
r3 = @spawnat 4 rand(4,4)
ras = [r1 r2; r3 r3]
D = DArray(ras)
Unfortunately, my experiments in #183 make me nervous about this.
Hm yes I suspect that there might be quite some functions written with the assumption that each chunk will be on a different processor, even though it clearly doesn't need to be assumed.
The following seems to work for me:
@everywhere using DistributedArrays
@everywhere using FillArrays
r1 = @spawnat 2 FillArrays.Zeros(4,4)
r2 = @spawnat 3 FillArrays.Zeros(4,4)
r3 = @spawnat 4 rand(4,4)
r5 = @spawnat 5 rand(4,4)
ras = [r1 r3; r5 r2]
D = DArray(ras)
[@fetchfrom p typeof(D[:L]) for p in workers()]
[@fetchfrom p eltype(D[:L]) for p in workers()]
And this can be done even with #170 in place.
Hm that fails for me on current master:
D = DArray(ras)
ERROR: MethodError: no method matching Zeros{Float64,2,Tuple{Int64,Int64}}(::Array{Float64,2})
Stacktrace:
[1] empty_localpart(::Type, ::Int64, ::Type) at /home/vchuravy/.julia/dev/DistributedArrays/src/darray.jl:66
[2] macro expansion at ./task.jl:264 [inlined]
[3] macro expansion at /home/vchuravy/.julia/dev/DistributedArrays/src/darray.jl:84 [inlined]
[4] macro expansion at ./task.jl:244 [inlined]
[5] DArray(::Tuple{Int64,Int64}, ::Array{Future,2}, ::Tuple{Int64,Int64}, ::Array{Int64,2}, ::Array{Tuple{UnitRange{Int64},UnitRange{Int64}},2}, ::Array{Array{Int64,1},1}) at /home/vchuravy/.julia/dev/DistributedArrays/src/darray.jl:82
[6] macro expansion at ./task.jl:266 [inlined]
[7] macro expansion at /home/vchuravy/.julia/dev/DistributedArrays/src/darray.jl:194 [inlined]
[8] macro expansion at ./task.jl:244 [inlined]
[9] DArray(::Array{Future,2}) at /home/vchuravy/.julia/dev/DistributedArrays/src/darray.jl:192
[10] top-level scope at none:0
Ah it works on #175..., but we need a method to select the "right" array-type. Since right now it could happen that it chooses a different T as a primary eltype and I think the process local version of this have diverged...
I'm running [aaf54ef3] DistributedArrays v0.5.1 right now.
(Is there a good development cycle for working from the master branch of a local repo?)
Isn't eltype the type of the data held by an array? It seems as though that could be enforced constant across the entire DArray, even if the containers holding the elements differ.
Is it possible to make the type more general to encompass the possibility of submatrices of different types?
(Is there a good development cycle for working from the master branch of a local repo?)
I generally use ]dev for that which can also take a local path
Isn't eltype the type of the data held by an array? It seems as though that could be enforced constant across the entire DArray, even if the containers holding the elements differ.
Yes the eltype being consitent is even more important and that is an invariant we need to uphold.
Is it possible to make the type more general to encompass the possibility of submatrices of different types?
There are two alternatives:
a = Fill(1.0, 2, 2)
b = zeros(2, 2)
AT = Union{typeof(a), typeof(b)}
and then teach DArray to reason about this union (immediate question would be what is similar(AT))
Or promote_type(typeof(a), typeof(b)) which is an AbstractArray{Float64, 2} so that could work quite well.