Gridap.jl
Gridap.jl copied to clipboard
Extend table to support arbitrary vector types
Change from this
struct Table{T,P} <: AbstractVector{Vector{T}}
data::Vector{T}
ptrs::Vector{P}
end
to this
struct Table{T,Vd<:AbstractVector{T},Vp<:AbstractVector} <: AbstractVector{Vector{T}}
data::Vd
ptrs::Vp
end
Allow setting values in a table using the standard julia interface of vector of vectors
I.e.,
a = Table(....)
a[i][j]
@santiagobadia @amartinhuertas
I have developed these simple benchmarks to check the performance hit of using views when indexing a table
module KK
struct Table{Vd,Vp}
data::Vd
ptrs::Vp
end
@inline function Base.getindex(a::Table,i::Integer)
@inbounds pini = a.ptrs[i]
@inbounds pend = a.ptrs[i+1]-1
ids = pini:pend
view(a.data,ids)
end
function Base.length(a::Table)
length(a.ptrs)-1
end
function length_to_ptrs!(ptrs)
ptrs[1] = 1
@inbounds for i in 1:(length(ptrs)-1)
ptrs[i+1] += ptrs[i]
end
end
function random_table(ng,nl)
ptrs = fill(nl,ng+1)
length_to_ptrs!(ptrs)
data = rand(nl*ng)
Table(data,ptrs)
end
function sum_like_array(a::Table)
s = 0.0
@inbounds for i in 1:length(a)
for j in 1:length(a[i])
s += a[i][j]
end
end
s
end
function sum_like_table(a::Table)
s = 0.0
@inbounds for i in 1:length(a)
pini = a.ptrs[i]
pend = a.ptrs[i+1]-1
for p in pini:pend
s += a.data[p]
end
end
s
end
function fill_like_array!(a::Table)
s = 0.0
@inbounds for i in 1:length(a)
for j in 1:length(a[i])
a[i][j] = s
end
end
a
end
function fill_like_table!(a::Table)
s = 0.0
@inbounds for i in 1:length(a)
pini = a.ptrs[i]
pend = a.ptrs[i+1]-1
for p in pini:pend
a.data[p] = s
end
end
a
end
using BenchmarkTools
a = random_table(Int(1e7),10)
@btime sum_like_array($a)
@btime sum_like_table($a)
@btime fill_like_array!($a)
@btime fill_like_table!($a)
end # module
The timing results are
129.072 ms (0 allocations: 0 bytes)
123.982 ms (0 allocations: 0 bytes)
115.403 ms (0 allocations: 0 bytes)
65.649 ms (0 allocations: 0 bytes)
It seems that accessing the fields of the table explicitly is more performant for the case in which we fill the table (65.649 ms vs 115.403 ms). For the case we only consume the values the performance is similar by explicitly accessing the fields or using views.
However, if we deactivate in-lining with
@noinline function Base.getindex(a::Table,i::Integer)
@inbounds pini = a.ptrs[i]
@inbounds pend = a.ptrs[i+1]-1
ids = pini:pend
view(a.data,ids)
end
Then, using views is very inefficient:
1.066 s (110000000 allocations: 4.92 GiB)
122.575 ms (0 allocations: 0 bytes)
1.048 s (110000000 allocations: 4.92 GiB)
66.400 ms (0 allocations: 0 bytes)
Hi @fverdugo,
thanks for sharing this micro-benchmark. Same (qualitative) outcome in my machine. In particular:
- With
@inline
and without it:
126.372 ms (0 allocations: 0 bytes)
120.451 ms (0 allocations: 0 bytes)
228.045 ms (0 allocations: 0 bytes)
64.823 ms (0 allocations: 0 bytes)
- With
@noinline
:
1.386 s (110000000 allocations: 4.92 GiB)
119.027 ms (0 allocations: 0 bytes)
1.349 s (110000000 allocations: 4.92 GiB)
64.400 ms (0 allocations: 0 bytes)
As a side note, I also tried in my machine the following optimization (at the expense of extra memory consumption), that some how better ressembles they way I am now handling the implementation of MPIPETScDistributedVector
:
struct Table{Vd,Vp,Vv}
data::Vd
ptrs::Vp
views::Vv
function Table(data::Vd,ptrs::Vp) where {Vd,Vp}
TSUB=SubArray{eltype(Vd),1,Vd,Tuple{UnitRange{Int64}},true}
views=TSUB[ view(data,ptrs[i]:ptrs[i+1]-1) for i=1:length(ptrs)-1 ]
new{Vd,Vp,typeof(views)}(data,ptrs,views)
end
end
@noinline function Base.getindex(a::Table,i::Integer)
@inbounds v=a.views[i]
v
end
- With
@inline
and without it:
144.999 ms (0 allocations: 0 bytes)
122.821 ms (0 allocations: 0 bytes)
106.697 ms (0 allocations: 0 bytes)
66.059 ms (0 allocations: 0 bytes)
- With
@noinline
:
296.765 ms (0 allocations: 0 bytes)
117.954 ms (0 allocations: 0 bytes)
241.276 ms (0 allocations: 0 bytes)
65.924 ms (0 allocations: 0 bytes)
First, there seems to be a performance hit whenever we sum all entries of the table (144.999 ms
versus 126.372 ms
), but a performance improvement whenever we fill the table (106.697 ms
versus 228.045 ms
). Second, although @noinline
impacts perfomance, it does not have such a significant impact (indeed now, as expected, there are no allocations).
In any case, what's clear from this experiment is that accessing the table entries using the raw p
and data
fields directly is the most efficient alternative (as we expected during the meeting). On the other hand, to be honest, looking just at the raw timings, it is hard (at least up to my current knowledge) to determine the actual causes behind the behaviour we are observing. There are very complex issues involved under the hood, with low level compiler optimizations and memory locality and the cache system at the foremost.