CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Replace unsafe_free! with finalize?

Open maleadt opened this issue 1 year ago • 2 comments

I seemed to remember that finalize is slow, and that is why we implemented our own refcounting and provided unsafe_free!. However, the cost seems manageable:

julia> @benchmark finalize(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  18.506 ns … 36.669 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.458 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.489 ns ±  0.536 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                            ▂▅█
  ▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▂▂▂▂▁▂▂▂▂▃▇▇▄▃███▅▃▃▃▃▂▂▂▂▂▁▂▁▂ ▃
  18.5 ns         Histogram: frequency by time        19.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark CUDA.unsafe_free!(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.010 ns … 18.370 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.080 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.093 ns ±  0.292 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                            █    ▅   ▁
  ▂▁▁▁▂▁▁▁▂▁▁▁▂▃▁▁▁▃▁▁▁▂▆▁▁▁█▁▁▁▂█▁▁▁█▁▁▁▂▆▁▁▁▄▁▁▁▂▃▁▁▁▃▁▁▁▂ ▂
  3.01 ns        Histogram: frequency by time        3.14 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@gbaraldi @vchuravy Thoughts? Does the cost maybe only manifest when the GC is loaded?

maleadt avatar Jul 12 '24 14:07 maleadt

IIRC it's a linear scan over the finalizer list, to remove the object from it.

So maybe create a couple thousand object with a finalizer and benchmark it then.

vchuravy avatar Jul 12 '24 17:07 vchuravy

Hmm, doesn't seem to significantly affect the performance:

julia> mutable struct ListNode
         key::Int64
         next::ListNode
         ListNode() = new()
         ListNode(x)= new(x)
         ListNode(x,y) = new(x,y)
       end

julia> function list(n=128)
           start::ListNode = ListNode(1)
           current::ListNode = start
           for i = 2:(n*1024^2)
               current = ListNode(i,current)
               finalizer(identity, current)
           end
           return current.key
       end
list (generic function with 2 methods)

julia> x = list();

julia> @benchmark finalize(a) setup=(a=CuArray([1]))
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.117 ns … 38.384 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.599 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.941 ns ±  0.811 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         █▇▆▄▃          ▆▆▄▄▁                                 ▂
  ▇▆▇▆▆████████▄▅▆▇▆▆▆▇▇██████▄▃▁▃▁▁▁▄▁▄▆▃▄█▇▆▅▄▃▁▁▃▃▁▄▅▄▄▆▆▄ █
  19.1 ns      Histogram: log(frequency) by time      22.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Even though the code for jl_finalize_th and finalize_object does indeed seems fairly complex, iterating finalizers and even allocating a list. Not sure why that isn't visible here.

maleadt avatar Jul 13 '24 08:07 maleadt