Memory leak when reading mat files v7
MWE:
for _ ∈ 1:1000000
matread("test/v7/array.mat")
end
Memory usage is growing until OOM. It works fine when reading mat files v6 and v7.3.
Julia and package version:
julia> versioninfo()
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
(jl_Qk9Uql) pkg> st
Status `/tmp/jl_Qk9Uql/Project.toml`
[23992714] MAT v0.10.3
I am also having excessive memory leak issues with v0.10.7 on Julia 1.11.2 when reading mat files.
I suspect this is related to the memory leak issue in the HDF5 library JuliaIO/HDF5.jl#1186
Not sure what's going on in the v6/v7 files. Note that they v5/v6/v7 are all using a similar binary file format, while v7.3 uses HDF5.
It's interesting that v7/array.mat uses more total memory in general:
julia> @time matread("test/v6/array.mat");
0.000587 seconds (130 allocations: 5.359 KiB)
julia> @time matread("test/v7/array.mat");
0.000928 seconds (298 allocations: 245.035 KiB)
Going into the code, it seems in v6 I encounter MAT_v5.miMATRIX types, while in v7 I encounter MAT_v5.miCOMPRESSED types in MAT_v5.read_matrix, which means v7 calls ZlibDecompressorStream(IOBuffer(read!(f, Vector{UInt8}(undef, nbytes))))
Perhaps there's an issue there. Either the ZlibDecompressorStream or the IOBuffer itself?
Hmm. I recall BufferedStreams.jl can help with gzip decompression performance. I also recall a new package being announced for improved buffering... If I have some more time I'll look deeper into this.
I tried on Julia 1.12.1 and it doesn't OOM for me, neither in v6 or v7, memory usage is stable. Could be this is improved with the newer memory layout types, which are also used in the IOBuffer.
Wrapping the IOStream with BufferedInputStream did not improve performance, so ignore that idea.
I'm a little concerned that I seem to get segfaults when I try to kill the loop, but it seems to be related to printing after the interrupt. When I add a ; behind the matread line, it doesn't segfault.
Version v7.3 definitely keep growing my memory. So that HDF5 memory leak issue seems real.
Calling the HDF5 garbage collector doesn't help for me. So not exactly sure it's the HDF5 issue. This code still increases memory usage slowly over time:
using MAT, HDF5
for n ∈ 1:1000000
if mod(n,1000)==0
println(n)
HDF5.API.h5_garbage_collect()
end
matread("test/v7.3/array.mat")
end
Calling GC.gc() after 100-200k runs reduces memory usage a little, but only ~10% or so.
I wonder if profiling memory usage could shed some light on the matter. Though not sure if that could catch any open file IO.