arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Subsequent reads are not faster

Open cstjean opened this issue 4 years ago • 0 comments

Perhaps I'm misunderstanding mmap, but I would expect that copy(df) would be much faster the second time around, because mmap has by then lazily loaded the data into memory. On our 0.5GB dataset, we get

julia> df = DataFrame(Arrow.Table("some_file.arrow")); 0
0

julia> @time copy(df); 0
  1.412632 seconds (1.03 k allocations: 616.774 MiB, 1.76% gc time)
0

julia> @time copy(df); 0
  1.451533 seconds (1.03 k allocations: 616.774 MiB, 7.55% gc time)
0

julia> @time copy(df); 0
  1.194862 seconds (1.03 k allocations: 616.774 MiB, 0.29% gc time)
0

In contrast, with plain mmap I see a factor of nearly 3.

Details
julia> m, n = 5_000_000, 300
(5000000, 300)

julia> A = rand(1:20, m, n);

julia> s = open("mmap.bin", "w+")
IOStream(<file mmap.bin>)

julia> write(s, A)
12000000000

julia> close(s)

julia> A = nothing

then we read it with mmap

julia> using Mmap

julia> m, n = 5_000_000, 300
(5000000, 300)

julia> s = open("mmap.bin", "r")
IOStream(<file mmap.bin>)

julia> A2 = Mmap.mmap(s, Matrix{Int}, (m,n)); 0
0

julia> @time copy(A2); 0
 17.176765 seconds (536 allocations: 11.176 GiB, 0.05% gc time, 0.01% compilation time)
0

julia> @time copy(A2); 0
  6.409250 seconds (2 allocations: 11.176 GiB, 0.03% gc time)
0

cstjean avatar Dec 17 '21 19:12 cstjean