ArrayInterface.jl icon indicating copy to clipboard operation
ArrayInterface.jl copied to clipboard

Boilerplate metrics

Open cscherrer opened this issue 2 years ago • 3 comments

As I understand, the main goal of this is to easily express algorithms that can use either standard dispatch or generated functions, depending on what's known statically. Writing very general high-performance code for a given situation is always possible; the problem is that it can sometimes require lots of boilerplate.

Would it be helpful to make this explicit? A given proposal could be weighed based on its effect on reducing boilerplate for high-performance code that works for both static and dynamic cases. Maybe there could even be a small collection of running examples. Complete code isn't necessary for this, what matters is dispatch patterns and efficiently (in terms of human time) passing the right information for generated functions.

cscherrer avatar Jan 05 '22 15:01 cscherrer

I'll try to respond in more detail later today when I have time, but for now I'd say the biggest issue is that we are missing basic tools for memory management that would allow us to generically write even simple methods

Tokazama avatar Jan 05 '22 15:01 Tokazama

I'm going to start with a simple method that should help illustrate some of the problems I've run into.


using BenchmarkTools

memory_reference(A) = _memory_reference(A, parent(A))
_memory_reference(::A, b::A) where {A} = b
_memory_reference(::A, b::B) where {A,B} = memory_reference(b)

@inline function fast_copy(src::AbstractArray{T,3}) where {T}
    buf = memory_reference(src)
    stride_1, stride_2, stride_3 = strides(src)
    size_1, size_2, size_3 = size(src)
    dest = Array{T}(undef, size(src))
    dest_index = 1
    i3 = 0
    @inbounds while i3 < (size_3 * stride_3)
        i2 = 0
        while i2 < (size_2 * stride_2)
            i1 = 0
            while i1 < (size_1 * stride_1)
                dest[dest_index] = buf[i1 + i2 + i3 + 1]
                dest_index += 1
                i1 += stride_1
            end
            i2 += stride_2
        end
        i3 += stride_3
    end
    return dest
end

x = PermutedDimsArray(rand(3, 3, 3), (3, 1, 2));


@btime copy($x);  # 99.267 ns (1 allocation: 272 bytes)
@btime fast_copy($x); # 48.823 ns (1 allocation: 272 bytes)

Although fast_copy looks better than Base.copy, we have a few problems:

  • src has to be a strided array.

    We could solve the first problem by checking for strides and then optimize on the cases where ArrayInterface.defines_strides(src) == true. The problem with this is that it only solves inefficiencies for strided arrays and we're back to explicit dispatch on every combination of array.

    fast_copy(A::Symmetric{T,<:StridedArray{T}}) where {T} = ...
    fast_copy(A::Symmetric{T,<:Wrapper{T,<:StridedArray{T}}}) where {T} = ...
    

    One could argue that a finite number of unique paramteric array combinations makes explicitly defining each method feasible. However, we just outperformed base on a simple parametric array (PermutedDimsArray{T,3,Array{T,3}}), so I think it's unlikely that any method (let alone every method) will be optimized for every parametric array type combination even if it's part of the standard library. I've been exploring this solutions to this as a "index transformation" + "buffer" break down, but this implementation requires a good chunk of code to work. Kind of hard to motivate maintaining that unless we can hook it into a usable system (i.e. solve the other problems here).

  • The body of this method needs to be more composable so that we don't have to rewrite this whole mess for similarly structured methods and types (e.g., map, getindex).

    Loops are hard to break up into composable pieces. I'm not aware of anything allows blocks of code between loops to cleanly communicate. fast_copy basically took the contents of a bunch of loops like this...

    for index_dim in axis_dim
        i_dim = (index_dim - first(axis_dim)) * stride_dim
        # other nested loops here
    end
    

    ...and consolidated the loop head (for index_dim in axis_dim) and body (index_dim - first(axis_dim)) * stride_dim, into a new loop header (while i_dim < (size_dim * stride_2)). We could always represent simple iterators as some sort of loop type and then have it compose new loop with index transformations

    struct SimpleLoop{dim,I}  # the dimension this loop corresponds to
        iterator::I
    end
    struct StrideTransform{O,S}
        offsets::O
        strides::S
    end
    ∘(loop::SimpleLoop{dim}, tform::StrideTransform) where {dim} = (loop.iterator .- tform.offsets[dim]) .* tform.strides[dim]
    

    But this is basically just reconceptualizing the index transformation for a SubArray as a series of distinct loops. Can we continue to rely on this strategy to represent loops when multiple arrays with very different layouts are involved? I think we'd ultimately need to prioritize optimization of access through one array's layout over another and at that point we're might just be looking at instruction costs. That sort of thing should probably be optimized by the compiler and if we try to manually optimize it we might actually lose some info the compiler could have used. Perhaps following the approach of Transducers.jl is best, and we should just provide a bunch of typed approaches for reduce, foldl, etc. It's hard to motivate adoption of single approach right now, given this stuff is still very much in development for Julia's interpreter.

  • Memory management is difficult to fine tune in Julia

    chriselrod long ago explained why writing a new immutable tuple is performance sensitive. Until we can reliably figure out how to write to a tuple ref and dereference it without taking a performance hit, we will have write a unique generated function specifically for writing each element in place. We've tried to figure out how to get alloca working so that we could iteratively write to a pointer, but I don't think anyone has figured out a solution.

    There's also distributed computing and multithreading, but these seem like they belong more as annotations on for loops that can be further informed by the compiler's knowledge of hardware.

We really need all of these components in place to provide concise solutions to most problems. That doesn't mean that we can't do some really cool stuff with what's here right now. It's not all that complicated to implement a version of fast_copy for deeply nested static array that returns another static array. It just requires a relatively verbose generated function.

Tokazama avatar Jan 06 '22 06:01 Tokazama

I've started putting together a more formal testing grounds for some related ideas I've been toying with here in case anyone is interested (well, more formal than intangible ideas).

Tokazama avatar Jan 11 '22 10:01 Tokazama