Replace unsafe_free! with finalize?
I seemed to remember that finalize is slow, and that is why we implemented our own refcounting and provided unsafe_free!. However, the cost seems manageable:
julia> @benchmark finalize(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min … max): 18.506 ns … 36.669 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.458 ns ┊ GC (median): 0.00%
Time (mean ± σ): 19.489 ns ± 0.536 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅█
▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▂▂▂▂▁▂▂▂▂▃▇▇▄▃███▅▃▃▃▃▂▂▂▂▂▁▂▁▂ ▃
18.5 ns Histogram: frequency by time 19.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.010 ns … 18.370 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.080 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.093 ns ± 0.292 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▅ ▁
▂▁▁▁▂▁▁▁▂▁▁▁▂▃▁▁▁▃▁▁▁▂▆▁▁▁█▁▁▁▂█▁▁▁█▁▁▁▂▆▁▁▁▄▁▁▁▂▃▁▁▁▃▁▁▁▂ ▂
3.01 ns Histogram: frequency by time 3.14 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
@gbaraldi @vchuravy Thoughts? Does the cost maybe only manifest when the GC is loaded?
IIRC it's a linear scan over the finalizer list, to remove the object from it.
So maybe create a couple thousand object with a finalizer and benchmark it then.
Hmm, doesn't seem to significantly affect the performance:
julia> mutable struct ListNode
key::Int64
next::ListNode
ListNode() = new()
ListNode(x)= new(x)
ListNode(x,y) = new(x,y)
end
julia> function list(n=128)
start::ListNode = ListNode(1)
current::ListNode = start
for i = 2:(n*1024^2)
current = ListNode(i,current)
finalizer(identity, current)
end
return current.key
end
list (generic function with 2 methods)
julia> x = list();
julia> @benchmark finalize(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min … max): 19.117 ns … 38.384 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.599 ns ┊ GC (median): 0.00%
Time (mean ± σ): 19.941 ns ± 0.811 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▆▄▃ ▆▆▄▄▁ ▂
▇▆▇▆▆████████▄▅▆▇▆▆▆▇▇██████▄▃▁▃▁▁▁▄▁▄▆▃▄█▇▆▅▄▃▁▁▃▃▁▄▅▄▄▆▆▄ █
19.1 ns Histogram: log(frequency) by time 22.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Even though the code for jl_finalize_th and finalize_object does indeed seems fairly complex, iterating finalizers and even allocating a list. Not sure why that isn't visible here.