StaticArrays.jl circshift(SArray) returns MArray

Example:

julia> sv = SVector(1,2,3);

julia> circshift(sv, 1)
3-element MArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
 3
 1
 2

I think the result should be SArray.

Feb 22 '20 22:02 wsshin

Agreed. The reason for this is that it's dispatching to Base and similar(::SVector) returns an MVector.

Feb 24 '20 10:02 c42f

I looked into this as I wanted to learn a bit more about Julia and this library, but, oh boy, this was (to a novice like me) a surprisingly difficult problem. The immutability basically forces a much more inconvenient implementation (with significant overhead), just like setindex.

So, unless one goes for an approach with a mutable temporary, you'd have to end up with something along the lines of;

circshift(a::StaticArray, shift::Int) = _circshift(Length(a), a, shift)
@generated function _circshift(::Length{L}, a::StaticArray{<:Tuple,T}, shift::Int) where {L, T}
    exprs = [:(a[1+mod($i-shift, L)]) for i = 0:(L-1)]
    return quote
        @inbounds return typeof(a)(tuple($(exprs...)))
    end
end

So, given the forced marriage of stack allocation and immutability in Julia, I have started to think of this more philosophically. Basically, the need for this a circshift (or a setindex) with a dynamic shift variable indicates an underlying problem. A "compile time" length for the shift is what we actually should want most of the times with static variables (I apologize for the messy code here; i'm new to generated functions and barely know what I'm doing)

function circshift(a::StaticArray{Tuple{L}}, ::Length{N}) where {L, N}
    vcat(_slice(Length(L-N+1), Length(L), a), _slice(Length(1), Length(L-N), a))
end
@generated function _slice(::Length{L}, ::Length{U}, a::StaticArray{<:Tuple,T}) where {L, U, T}
    exprs = [:(a[$i]) for i = L:U]
    return quote
        @inbounds return SArray{Tuple{1+U-L},T,1,1+U-L}(tuple($(exprs...)))
    end
end

which, of course, is super-fast if one specifies a static shift circshift(a, Length(3))

Apr 13 '20 00:04 Micket

I looked into this as I wanted to learn a bit more about Julia and this library, but, oh boy, this was (to a novice like me) a surprisingly difficult problem.

Haha! Well thanks for jumping right in I think you're on the right track here :-)

So, unless one goes for an approach with a mutable temporary, you'd have to end up with something along the lines of [...]

Yes I think that's basically the right way to implement it. However it might not be so bad as you think: the compiler is quite good at constant propagation, so in the cases where shift is constant in the caller, the code for all the occurrences of mod in _circshift will be optimized away. However for this to work you may need to encourage some extra inlining:

@inline Base.circshift(a::StaticArray, shift::Int) = _circshift(Length(a), a, shift)
@generated function _circshift(::Length{L}, a::StaticArray{<:Tuple,T}, shift::Int) where {L, T}
    exprs = [:(a[1+mod($i-shift, L)]) for i = 0:(L-1)]
    return quote
        $(Expr(:meta, :inline))
        @inbounds return typeof(a)(tuple($(exprs...)))
    end
end

Note the appearance of Expr(:meta, :inline) in the generated code, which is the confusing way you need to write @inline for @generated functions. It seemed I also needed to add @inline to circshift.

Thus we get extremely simple native code when the shift amount can constant propagated all the way inward. Eg, for a function foo:

julia> foo(a) = circshift(a, length(a)÷2)
foo (generic function with 1 method)

julia> foo(SA[1,2,3,4,5])
5-element SArray{Tuple{5},Int64,1,5} with indices SOneTo(5):
 4
 5
 1
 2
 3

julia> @code_native foo(SA[1,2,3,4,5])
	.text
; ┌ @ REPL[11]:1 within `foo'
	movq	%rdi, %rax
; │┌ @ REPL[9]:1 within `circshift'
; ││┌ @ REPL[3]:2 within `_circshift'
; │││┌ @ REPL[3]:5 within `macro expansion'
	vmovups	24(%rsi), %xmm0
	vmovups	(%rsi), %xmm1
	movq	16(%rsi), %rcx
; │└└└
	vmovups	%xmm0, (%rdi)
	vmovups	%xmm1, 16(%rdi)
	movq	%rcx, 32(%rdi)
	retq
	nop
; └

So, given the forced marriage of stack allocation and immutability in Julia

Luckily this isn't quite true :-) The compiler will commonly allocateMVector on the stack as well, but only if it can prove that it's a temporary which doesn't escape the stack frame. So one viable alternative to the generated function might be to use a temporary MVector, but re-wrap in the type of the input vector before returning it. This will convert to SVector if an SVector was passed in.

Apr 13 '20 01:04 c42f

Yeah I didn't manage to get it to inline properly, so i saw very poor performance.

Though, the tricky case is that performance is somewhat poor (as arrays get a bit larger) when the calls to mod can't optimized away

Experimenting a bit

@inline function circshift(v::SVector{L}, shift::Integer) where L
    out = similar(v)
    shift = mod(shift, L)
    cut = L-shift
    @inbounds for i in 1:(L-cut)
        out[i] = v[i+cut]
    end
    @inbounds for i in 1:cut
        out[i+shift] = v[i]
    end
    return SVector(out)
end

So, this version seems to be sufficiently optimizer friendly, with the added benefit of not having to mess about with generated functions (a plus in I.M.H.O.) and the fact that it's fast in cases where shift isn't known at compile time.

Apr 13 '20 15:04 Micket

not having to mess about with generated functions (a plus in I.M.H.O.)

Yes it's always nice to avoid @generated!

function circshift(v::SVector{L}, shift::Integer) where L

You're not using L in dispatch here, so you can simplify the signature and just use length internally which should cleaner and can easily be optimized as it can always be known from typeof(v):

@inline function circshift(v::StaticVector, shift::Integer)
    w = similar(v)
    L = length(v)
    shift = mod(shift, L)
    cut = L-shift
    @inbounds for i in 1:(L-cut)
        w[i] = v[i+cut]
    end
    @inbounds for i in 1:cut
        w[i+shift] = v[i]
    end
    return similar_type(v)(w)
end

Apr 14 '20 02:04 c42f

Really nice - so that does the end up with good code these days? Do we end up with a bunch of extra move instructions on the last line or is it ellided entirely?

What happens for big numbers and other mutable things that might not work well inside an MArray?

@c42f seeing this I feel like we are edging closer to making this package simpler or even redundant - I mean that function is basically good for AbstractArray! I'd simply replace the last line with return freeze(w) (a la Keno's suggestion). To fix MArray we could really use a "mutable tuple" which freezes to a Tuple... and then maybe constant propagation is the only "magic" we need here.

Apr 15 '20 00:04 andyferris

What happens for big numbers and other mutable things that might not work well inside an MArray?

There's a WIP PR to Base (https://github.com/JuliaLang/julia/pull/34126) which puts pointers inline, so that could maybe help fix MArray with BigFloat etc (though we'd still need extra care to preserve GC invariants). In fact that suggests #749 might be reasonable to merge.

seeing this I feel like we are edging closer to making this package simpler or even redundant - I mean that function is basically good for AbstractArray! I'd simply replace the last line with return freeze(w) (a la Keno's suggestion).

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

Apr 15 '20 01:04 c42f

There's a WIP PR to Base (JuliaLang/julia#34126) which puts pointers inline, so that could maybe help fix MArray with BigFloat etc (though we'd still need extra care to preserve GC invariants).

Unfortunately this doesn't affect all objects (structs with union types and unallocated fields may be treated differently).

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

Yeah! Let's do it :) I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

Apr 15 '20 02:04 andyferris

I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

I'm thinking to do this for a while for structs and tuples. It's super easy to implement this with Setfield.jl: https://github.com/jw3126/Setfield.jl/issues/56#issuecomment-516696745

Apr 15 '20 06:04 tkf

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

Yeah! Let's do it :) I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

So... I got overexcited. I think I need this for Dictionaries.jl anyways.

https://github.com/andyferris/Freeze.jl

Apr 15 '20 07:04 andyferris

You beat me to it! I hope I'm not slowing down your excitement but I think issettable is too coarse-grained. It might be better to be able to freeze/thaw different entities (e.g., shape-frozen vs value-frozen). For example, you'd want shape-frozen value-mutable arrays for columns of tables. Also, I've been thinking that it'd be nice to have an ownership-as-convention kind of API. This way, you can get a from thaw(move(freeze(a))) "safely."

Apr 15 '20 07:04 tkf

Haha!

Yes we definitely need “shape frozen” vs “value frozen” semantics and I meant that issettable is only for mutating values not the shape. My experience with Dictionaries.jl is containers whose keys change are significantly more challenging to design shared generic code for, and also all static arrays are shape frozen anyway.

If you are going to think about these things remember that append-only datasets are a thing and that we might want eg a vector that supports push! but not setindex!. Thinking if it this way we can keep the two concepts simpler and orthogonal rather than complecting them. (That’s also why setindex! doesn’t insert new keys in Dictionaries.jl).

For dictionaries I was thinking isinsertable, freezekeys and thawkeys could be the (orthogonal) “shape” interface. What do you think?

Apr 15 '20 09:04 andyferris

Note that it is possible to write something like

function matmul(a::SMatrix{I, J}, b::SMatrix{J, K}) where {I, J, K}
    c = zero(MMatrix{I, K})
    @inbounds for k in 1:K, j in 1:J
        b_jk = b[j, k]
        @simd for i in 1:I
            c[i, k] = muladd(a[i, j], b_jk, c[i, k])
        end
    end
    return SMatrix(c)
end

now, without causing allocations. It seems LLVM only unrolls the innermost loop now for some reason though.

Apr 15 '20 10:04 KristofferC

(I opened https://github.com/andyferris/Freeze.jl/issues/1 to avoid derailing this issue too much.)

Apr 15 '20 17:04 tkf

So I started experimenting further: https://github.com/andyferris/StaticArraysLite.jl

So far the basic functionality is quite competitive to StaticArrays for SVector{3, Float64}. Would be very interesting to see how this goes for matrices and slightly larger arrays.

Apr 16 '20 13:04 andyferris

I need to use circshift() on SVector, and I discovered someone submitted this issue a while ago. It turned out that it was actually myself 😁.

Back then I just reported the lack of this functionality and didn't think about the solution. Base.circshift() supports multi-dimensional arrays, but if we focus on vectors, would it be possible to add the following simple solutions?

Base.circshift(v::SVector{N,T}, shift::Integer) where {N,T} = SVector(ntuple(k->v[mod1(k-shift,N)], Val(N)))
Base.circshift(v::SVector{N,T}, ::Val{S}) where {N,T,S} = SVector(ntuple(k->v[mod1(k-S,N)], Val(N)))

Here are the benchmark results:

julia> VERSION
v"1.7.2-pre.0"

julia> v = SVector(ntuple(identity, 10))
10-element SVector{10, Int64} with indices SOneTo(10):
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> @btime circshift($v, 3);
  31.051 ns (0 allocations: 0 bytes)

julia> @btime circshift($v, Val(3));
  2.641 ns (0 allocations: 0 bytes)

Feb 22 '22 05:02 wsshin

StaticArrays.jl StaticArrays.jl copied to clipboard

circshift(SArray) returns MArray

StaticArrays.jl
StaticArrays.jl copied to clipboard