DataFrames.jl
DataFrames.jl copied to clipboard
Memory issue when repeatedly creating large DataFrames
Description
Creating a large panel, and then "overwriting" it in a loop causes the process to just eat up more and more memory.
MWE
Note that this will example will use significant memory on your computer
using DataFrames
function inner_df(Nrow,Ncol)
df = DataFrame(rand(Nrow,Ncol),:auto)
end
function outer_df(N)
Nrow=76
Ncol=21
df=DataFrame(rand(0,Ncol),:auto)
for i = 1:N
append!(df,inner_df(Nrow,Ncol))
end
return df
end
function iterate(Niter)
for i =1:Niter
println(i)
@time pan = outer_df(1000*10*10)
end
end
iterate(10)
Expected Behavior
I don't know anything about the internal workins of this, but I assumed that in the second iteration of iterate
that memory usage would not increase. On my system, virtual, residential, and %RAM all increase (outputs from top
) on every iteration. Moreover, the memory is not cleared fully after the function terminates. Running it one more increases memory usage even more. Running garbage collection GC.gc()
doesn't work either.
Versions:
[a93c6f00] DataFrames v1.2.2
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
JULIA_DEPOT_PATH = /sftwr/user-pkg/m1eeb00/julia
JULIA_PKG_SERVER = pkg.julialang.org
JULIA_EDITOR = code
JULIA_NUM_THREADS = 6
JULIA_BINDIR = /apps/julia/current/bin/frb
I cannot reproduce it under Julia 1.6. Can you please check on Julia 1.6?
1
5.847684 seconds (17.12 M allocations: 3.550 GiB, 16.62% gc time, 9.96% compilation time)
2
5.648060 seconds (16.40 M allocations: 4.264 GiB, 13.83% gc time)
3
4.971991 seconds (16.40 M allocations: 3.532 GiB, 13.15% gc time)
4
5.633438 seconds (16.40 M allocations: 4.264 GiB, 16.16% gc time)
5
4.766959 seconds (16.40 M allocations: 3.509 GiB, 14.61% gc time)
6
4.036575 seconds (16.40 M allocations: 4.264 GiB, 17.34% gc time)
7
3.996567 seconds (16.40 M allocations: 3.509 GiB, 14.60% gc time)
8
4.200314 seconds (16.40 M allocations: 4.264 GiB, 16.51% gc time)
9
3.918262 seconds (16.40 M allocations: 3.527 GiB, 12.85% gc time)
10
4.112989 seconds (16.40 M allocations: 4.264 GiB, 16.20% gc time)
(also on system level all looks good)
I will keep this open, but your issue seems to be related to Julia not DataFrames.jl (so probably we should re-report it there if someone will be able to reproduce your problem on a current version of Julia)
Sure. I only have 1.5.3 on my work server, but here are the results again, this time with the results from top
as well to illustrate. Note that the output from @time is the same (ignoring the first run);
1
6.585599 seconds (23.57 M allocations: 4.369 GiB, 16.76% gc time)
2
5.296380 seconds (22.70 M allocations: 4.326 GiB, 12.28% gc time)
3
5.581431 seconds (22.70 M allocations: 4.715 GiB, 13.31% gc time)
and from top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
52196 m1eeb00 20 0 1252232 309544 107352 S 0.0 0.1 0:07.99 julia
52196 m1eeb00 20 0 1271548 331668 107584 R 93.8 0.1 0:10.87 julia
52196 m1eeb00 20 0 2352188 1.011g 107648 R 100.0 0.4 0:20.09 julia
52196 m1eeb00 20 0 3325976 1.487g 107648 R 100.0 0.6 0:23.16 julia
52196 m1eeb00 20 0 5876004 3.496g 107648 R 100.0 1.4 0:29.28 julia
52196 m1eeb00 20 0 5234084 3.093g 107648 R 100.0 1.2 0:32.35 julia
52196 m1eeb00 20 0 5127100 3.217g 107648 R 100.0 1.3 0:41.55 julia
52196 m1eeb00 20 0 6184624 4.447g 107648 R 99.7 1.8 0:50.75 julia
52196 m1eeb00 20 0 6714876 4.560g 107644 S 0.0 1.8 1:11.68 julia
Here you can see what the problem is more clearly. More or less after every iteration the memory usage just goes up. Note that the last and first row are values before I start running iterations(10)
Here's the output from my laptop. On my laptop, you see a somewhat steady increase in %MEM and RES. But the issue is a lot smaller than on my work server, that's for sure
Julia 1.5.0 on laptop with DataFrames 0.2.2 (outdated)
First, Julia 1.5.0 on my personal laptop:
julia> iterate(20)
1
3.499147 seconds (19.90 M allocations: 4.394 GiB, 13.01% gc time)
2
3.009857 seconds (19.90 M allocations: 4.301 GiB, 12.54% gc time)
... # Allocations are stable on my end to
And selected output from top
:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
273124 eirik 20 0 3759508 2.4g 100156 R 99.7 15.5 0:11.39 julia
273124 eirik 20 0 3934612 2.6g 100156 R 99.7 17.2 0:17.40 julia
273124 eirik 20 0 3954068 2.3g 100156 R 99.7 14.8 0:29.40 julia
273124 eirik 20 0 5212536 3.1g 100156 R 100.0 20.5 0:35.42 julia
273124 eirik 20 0 5179864 2.8g 100156 R 99.7 18.3 0:44.44 julia
273124 eirik 20 0 4901984 2.9g 100156 R 100.0 18.9 0:47.45 julia
273124 eirik 20 0 4576724 2.9g 100156 R 99.7 18.7 0:53.45 julia
273124 eirik 20 0 5193956 3.2g 100156 S 0.0 20.8 1:09.93 julia
Note that the last row in this output is after the function has finished running, so the memory isn't cleared.
Julia 1.5.0 on laptop with DataFrames 1.2.2 (outdated)
First, Julia 1.5.0 on my personal laptop:
julia> iterate(20)
1
3.499147 seconds (19.90 M allocations: 4.394 GiB, 13.01% gc time)
2
3.009857 seconds (19.90 M allocations: 4.301 GiB, 12.54% gc time)
... # Allocations are stable on my end to
And selected output from top
:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
303795 eirik 20 0 1904480 923856 102388 R 97.4 5.7 0:05.78 julia
303795 eirik 20 0 3575000 2.1g 102388 R 100.0 13.9 0:11.76 julia
303795 eirik 20 0 5196368 3.1g 102388 R 100.0 19.9 0:14.77 julia
303795 eirik 20 0 5225916 2.9g 102388 R 100.3 18.8 0:23.80 julia
303795 eirik 20 0 4660636 3.0g 102388 R 100.3 19.4 0:29.81 julia
303795 eirik 20 0 4728656 2.8g 102388 R 99.7 18.0 0:35.82 julia
303795 eirik 20 0 5291736 3.2g 102388 R 100.0 21.1 0:56.88 julia
303795 eirik 20 0 5262516 3.2g 102388 S 0.0 20.9 1:09.70 julia
Note that the last row in this output is after the function has finished running, so the memory isn't cleared.
Julia 1.6.1 on laptop with DataFrames 1.2.2:
1
4.342368 seconds (23.58 M allocations: 4.004 GiB, 16.49% gc time)
2
3.280944 seconds (22.70 M allocations: 4.437 GiB, 12.37% gc time)
3
3.384510 seconds (22.70 M allocations: 3.960 GiB, 14.74% gc time)
4
3.275392 seconds (22.70 M allocations: 3.963 GiB, 14.59% gc time)
5
3.231010 seconds (22.70 M allocations: 3.960 GiB, 14.70% gc time)
and from top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
344413 eirik 20 0 816872 276524 92776 R 15.9 1.7 0:01.79 julia
344413 eirik 20 0 908124 355792 101124 R 43.2 2.2 0:03.72 julia
344413 eirik 20 0 4523948 2.3g 101124 R 99.7 15.1 0:09.72 julia
344413 eirik 20 0 3917056 2.6g 101124 R 100.0 17.1 0:24.75 julia
344413 eirik 20 0 3672040 2.3g 101124 R 99.7 14.8 0:27.75 julia
344413 eirik 20 0 5212180 3.0g 101124 R 99.3 19.6 0:36.75 julia
344413 eirik 20 0 5020940 2.8g 101124 R 100.0 18.0 0:42.76 julia
344413 eirik 20 0 4481016 2.7g 101124 R 99.7 17.7 0:48.77 julia
344413 eirik 20 0 5154240 3.3g 101816 S 0.0 21.6 1:10.82 julia
I have checked Julia 1.0.5, Julia 1.6.2, Julia 1.6.3, and Julia 1.8 both on Win10 and Linux and single and multi-threaded and I cannot reproduce your problems. Sometimes indeed GC does not clean up things immediately but if you run several times GC.gc()
memory is reclaimed (note that because of the design of GC sometimes you need to run it several times to force full sweep).
Also I constantly have 16.40 M allocations
(except Julia 1.0.5 which has more), so it is very strange that you have 22.70 M allocations
for the same code.
Hi @bkamins, thanks for looking into this. I just went back to check my results
This is what I did:
- Take a clean Julia 1.5.3 environment on my work server
] add Dataframes
using DataFrames
- I then copy-pasted the code from my first post, and I always get (22.70 M allocations: 5.495 GiB ... )`
I tried adding about 30 GC.gc()
statements in the final loop, and it did drop memory usage somewhat, but still memory keeps climbing until it reaches some kind of limit.
# Without GC.gc() statements at the end of final loop
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
121770 m1eeb00 20 0 745368 352432 102212 R 90.8 0.1 0:03.36 julia
121770 m1eeb00 20 0 3176688 1.953g 102212 R 100.0 0.8 0:09.50 julia
121770 m1eeb00 20 0 3903148 2.509g 102212 R 99.7 1.0 0:15.64 julia
121770 m1eeb00 20 0 6318932 3.169g 102212 R 100.0 1.3 0:40.18 julia
121770 m1eeb00 20 0 6285304 4.345g 102212 S 0.0 1.7 0:56.83 julia
# With GC.gc() statements at the end of final loop
71459 m1eeb00 20 0 927784 437704 102016 R 59.2 0.2 0:03.96 julia
71459 m1eeb00 20 0 2221440 1.244g 102016 R 100.0 0.5 0:07.03 julia
71459 m1eeb00 20 0 3835136 1.684g 102016 R 100.0 0.7 0:16.23 julia
71459 m1eeb00 20 0 4007332 2.417g 102016 R 100.0 1.0 0:31.59 julia
71459 m1eeb00 20 0 3929452 2.369g 102016 R 100.0 0.9 0:43.86 julia
71459 m1eeb00 20 0 5152936 3.415g 102016 R 99.7 1.4 0:59.21 julia
71459 m1eeb00 20 0 3933288 1.995g 102016 S 0.0 0.8 1:08.62 julia
I'll try to verify again on my own laptop and see if I can get somebody else to run the code on the work server to verify there as well.
@nalimilan - could you please have a look at this on your machine? Thank you!
Here with Julia 1.6.2 on 64-bit Fedora 34 memory usage goes up to 3.3GB after running the loop. Then calling GC.gc()
reduces it to 1.1GB, but it never goes back to the original level of 300MB. Not sure whether that qualifies as problematic or not. Anyway discussion in the Julia issue will probably be more productive.
@eirikbrandsaas - what @nalimilan means (as it might not be clear from what he has written) - is that we both feel that if there is some memory leakage - it is most likely outside of DataFrames.jl and the issue should be reported in https://github.com/JuliaLang/julia. (if you would be willing to do this could you please ping us there so that we can monitor the issue - thank you!)
https://github.com/JuliaLang/julia/issues/42566 is already about this problem, right?
JuliaLang/julia#42566 is already about this problem, right?
@bkamins and @nalimilan Yes, that is the same issue. (I originally posted this in the #helpdesk channel on Slack, didn't get a response, posted this here, and a few days later someone opened the issue on the Julia github page).
Thank you for tracing this!