arrow-julia
arrow-julia copied to clipboard
Subsequent reads are not faster
Perhaps I'm misunderstanding mmap, but I would expect that copy(df) would be much faster the second time around, because mmap has by then lazily loaded the data into memory. On our 0.5GB dataset, we get
julia> df = DataFrame(Arrow.Table("some_file.arrow")); 0
0
julia> @time copy(df); 0
1.412632 seconds (1.03 k allocations: 616.774 MiB, 1.76% gc time)
0
julia> @time copy(df); 0
1.451533 seconds (1.03 k allocations: 616.774 MiB, 7.55% gc time)
0
julia> @time copy(df); 0
1.194862 seconds (1.03 k allocations: 616.774 MiB, 0.29% gc time)
0
In contrast, with plain mmap I see a factor of nearly 3.
Details
julia> m, n = 5_000_000, 300
(5000000, 300)
julia> A = rand(1:20, m, n);
julia> s = open("mmap.bin", "w+")
IOStream(<file mmap.bin>)
julia> write(s, A)
12000000000
julia> close(s)
julia> A = nothing
then we read it with mmap
julia> using Mmap
julia> m, n = 5_000_000, 300
(5000000, 300)
julia> s = open("mmap.bin", "r")
IOStream(<file mmap.bin>)
julia> A2 = Mmap.mmap(s, Matrix{Int}, (m,n)); 0
0
julia> @time copy(A2); 0
17.176765 seconds (536 allocations: 11.176 GiB, 0.05% gc time, 0.01% compilation time)
0
julia> @time copy(A2); 0
6.409250 seconds (2 allocations: 11.176 GiB, 0.03% gc time)
0