StructArrays.jl icon indicating copy to clipboard operation
StructArrays.jl copied to clipboard

Should indexing return a LazyRow by default?

Open dpsanders opened this issue 6 years ago • 5 comments

It seems like a gotcha that indexing into a StructArray will materialize a new struct. Maybe it should return a LazyRow by default?

dpsanders avatar Jan 01 '20 19:01 dpsanders

If all the fields are isbits then materializing a new struct does not allocate and is a reasonable thing to do. If your StructArray is large, OTOH, you should probably iterate using LazyRows(s::StructArray).

I'd rather not change the default, as many users use StructArray for small, isbits structs, like ComplexF64 and expect the correct type being returned, but I agree that the performance characteristics in the non-isbits (or "wide") case should be better documented. I'm keeping this issue open as a reminder to add docs about this.

piever avatar Jan 01 '20 21:01 piever

Thanks. My use case is not iterating but indexing into given elements. The (native) code generated for that (using LazyRow) seems to be surprisingly complicated, so I'm worried about performance.

dpsanders avatar Jan 02 '20 14:01 dpsanders

Benchmarking on a cheap operation (like sum) does not reveal a difference:

julia> using StructArrays, BenchmarkTools

julia> s = StructArray(a = rand(1000), b = rand(1000));

julia> @btime sum(LazyRow($s, i).a for i in 1:1000);
  1.038 μs (0 allocations: 0 bytes)

julia> @btime sum($s.a[i] for i in 1:1000);
  1.037 μs (0 allocations: 0 bytes)

Regarding the native code complexity, it may be a matter of adding @inbounds. If one uses getproperty within @inbounds the native code seems fine, compare:

julia> s = StructArray(a = rand(10), b = rand(10));

julia> l = LazyRow(s, 1);

julia> f(l) = (ret = @inbounds l.a; ret)
f (generic function with 1 method)

julia> @code_native f(l)
        .text
; ┌ @ REPL[30]:1 within `f'
; │┌ @ REPL[30]:1 within `getproperty'
        movq    (%rdi), %rax
; │└
; │┌ @ lazy.jl:9 within `getproperty'
; ││┌ @ array.jl:744 within `getindex'
        movq    8(%rdi), %rcx
; ││└
; ││ @ lazy.jl:9 within `getproperty' @ structarray.jl:126
; ││┌ @ structarray.jl:124 within `fieldarrays'
        movq    (%rax), %rax
; ││└
        movq    (%rax), %rax
; │└
; │┌ @ array.jl:744 within `getproperty'
        movq    (%rax), %rax
        vmovsd  -8(%rax,%rcx,8), %xmm0  # xmm0 = mem[0],zero
; │└
        retq
        nopw    (%rax,%rax)
; └

with the hand-written version

julia> g(s) = (@inbounds ret = s.a[1]; ret)
g (generic function with 1 method)

julia> @code_native g(s)
        .text
; ┌ @ REPL[33]:1 within `g'
; │┌ @ structarray.jl:126 within `getproperty'
; ││┌ @ REPL[33]:1 within `fieldarrays'
        movq    (%rdi), %rax
; ││└
        movq    (%rax), %rax
; │└
; │┌ @ array.jl:744 within `getindex'
        movq    (%rax), %rax
        vmovsd  (%rax), %xmm0           # xmm0 = mem[0],zero
; │└
        retq
        nop
; └

This way of using @inbounds may be counter-intuitive, maybe I should do the @boundscheck when generating the LazyRow, not when accessing its fields.

piever avatar Jan 02 '20 18:01 piever

That's nice, thanks. Yes, that is rather counterintuitive; it would be good to hide that inside the implementation.

dpsanders avatar Jan 02 '20 20:01 dpsanders

@piever I came here to ask a similar question. In my case I am using StructArrays for a matrix of composite structs.

A problem is that I want to also segment the elements into Sets. It took me a while to realize why membership checks were always failing, and indeed why

S[i] == S[i] %false

for a given structarray S and index i—the materializations are different objects, while:

LazyRow(S, i) == LazyRow(S, i) %true.

Is it possible to perhaps parameterize what getindex will return? Getting the LazyRow() back would greatly simplify my code and allow consistent patterns with other data structures.

jlambvo avatar Mar 06 '24 17:03 jlambvo