DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Memory issue when repeatedly creating large DataFrames

Open eirikbrandsaas opened this issue 2 years ago • 11 comments

Description

Creating a large panel, and then "overwriting" it in a loop causes the process to just eat up more and more memory.

MWE

Note that this will example will use significant memory on your computer

using DataFrames
function inner_df(Nrow,Ncol)

    df = DataFrame(rand(Nrow,Ncol),:auto)
end

function outer_df(N)
    Nrow=76
    Ncol=21
    df=DataFrame(rand(0,Ncol),:auto)

    for i = 1:N
        append!(df,inner_df(Nrow,Ncol))
    end
    return df
end

function iterate(Niter)
    for i =1:Niter
        println(i)
        @time pan = outer_df(1000*10*10)
    end
end

iterate(10)

Expected Behavior

I don't know anything about the internal workins of this, but I assumed that in the second iteration of iterate that memory usage would not increase. On my system, virtual, residential, and %RAM all increase (outputs from top) on every iteration. Moreover, the memory is not cleared fully after the function terminates. Running it one more increases memory usage even more. Running garbage collection GC.gc() doesn't work either.

Versions:

[a93c6f00] DataFrames v1.2.2
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_DEPOT_PATH = /sftwr/user-pkg/m1eeb00/julia
  JULIA_PKG_SERVER = pkg.julialang.org
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 6
  JULIA_BINDIR = /apps/julia/current/bin/frb

eirikbrandsaas avatar Oct 08 '21 16:10 eirikbrandsaas

I cannot reproduce it under Julia 1.6. Can you please check on Julia 1.6?

1
  5.847684 seconds (17.12 M allocations: 3.550 GiB, 16.62% gc time, 9.96% compilation time)
2
  5.648060 seconds (16.40 M allocations: 4.264 GiB, 13.83% gc time)
3
  4.971991 seconds (16.40 M allocations: 3.532 GiB, 13.15% gc time)
4
  5.633438 seconds (16.40 M allocations: 4.264 GiB, 16.16% gc time)
5
  4.766959 seconds (16.40 M allocations: 3.509 GiB, 14.61% gc time)
6
  4.036575 seconds (16.40 M allocations: 4.264 GiB, 17.34% gc time)
7
  3.996567 seconds (16.40 M allocations: 3.509 GiB, 14.60% gc time)
8
  4.200314 seconds (16.40 M allocations: 4.264 GiB, 16.51% gc time)
9
  3.918262 seconds (16.40 M allocations: 3.527 GiB, 12.85% gc time)
10
  4.112989 seconds (16.40 M allocations: 4.264 GiB, 16.20% gc time)

(also on system level all looks good)

I will keep this open, but your issue seems to be related to Julia not DataFrames.jl (so probably we should re-report it there if someone will be able to reproduce your problem on a current version of Julia)

bkamins avatar Oct 08 '21 16:10 bkamins

Sure. I only have 1.5.3 on my work server, but here are the results again, this time with the results from top as well to illustrate. Note that the output from @time is the same (ignoring the first run);

1
  6.585599 seconds (23.57 M allocations: 4.369 GiB, 16.76% gc time)
2
  5.296380 seconds (22.70 M allocations: 4.326 GiB, 12.28% gc time)
3
  5.581431 seconds (22.70 M allocations: 4.715 GiB, 13.31% gc time)

and from top

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ 
 52196 m1eeb00   20   0 1252232 309544 107352 S   0.0  0.1   0:07.99 julia     
 52196 m1eeb00   20   0 1271548 331668 107584 R  93.8  0.1   0:10.87 julia 
 52196 m1eeb00   20   0 2352188 1.011g 107648 R 100.0  0.4   0:20.09 julia  
 52196 m1eeb00   20   0 3325976 1.487g 107648 R 100.0  0.6   0:23.16 julia 
 52196 m1eeb00   20   0 5876004 3.496g 107648 R 100.0  1.4   0:29.28 julia  
 52196 m1eeb00   20   0 5234084 3.093g 107648 R 100.0  1.2   0:32.35 julia  
 52196 m1eeb00   20   0 5127100 3.217g 107648 R 100.0  1.3   0:41.55 julia 
 52196 m1eeb00   20   0 6184624 4.447g 107648 R  99.7  1.8   0:50.75 julia 
 52196 m1eeb00   20   0 6714876 4.560g 107644 S   0.0  1.8   1:11.68 julia  

Here you can see what the problem is more clearly. More or less after every iteration the memory usage just goes up. Note that the last and first row are values before I start running iterations(10)

eirikbrandsaas avatar Oct 08 '21 17:10 eirikbrandsaas

Here's the output from my laptop. On my laptop, you see a somewhat steady increase in %MEM and RES. But the issue is a lot smaller than on my work server, that's for sure

Julia 1.5.0 on laptop with DataFrames 0.2.2 (outdated)

First, Julia 1.5.0 on my personal laptop:

julia> iterate(20)
1
  3.499147 seconds (19.90 M allocations: 4.394 GiB, 13.01% gc time)
2
  3.009857 seconds (19.90 M allocations: 4.301 GiB, 12.54% gc time)
... # Allocations are stable on my end to

And selected output from top:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND   
 273124 eirik     20   0 3759508   2.4g 100156 R  99.7  15.5   0:11.39 julia     
 273124 eirik     20   0 3934612   2.6g 100156 R  99.7  17.2   0:17.40 julia      
 273124 eirik     20   0 3954068   2.3g 100156 R  99.7  14.8   0:29.40 julia         
 273124 eirik     20   0 5212536   3.1g 100156 R 100.0  20.5   0:35.42 julia         
 273124 eirik     20   0 5179864   2.8g 100156 R  99.7  18.3   0:44.44 julia        
 273124 eirik     20   0 4901984   2.9g 100156 R 100.0  18.9   0:47.45 julia     
 273124 eirik     20   0 4576724   2.9g 100156 R  99.7  18.7   0:53.45 julia             
 273124 eirik     20   0 5193956   3.2g 100156 S   0.0  20.8   1:09.93 julia      

Note that the last row in this output is after the function has finished running, so the memory isn't cleared.

Julia 1.5.0 on laptop with DataFrames 1.2.2 (outdated)

First, Julia 1.5.0 on my personal laptop:

julia> iterate(20)
1
  3.499147 seconds (19.90 M allocations: 4.394 GiB, 13.01% gc time)
2
  3.009857 seconds (19.90 M allocations: 4.301 GiB, 12.54% gc time)
... # Allocations are stable on my end to

And selected output from top:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND   
 303795 eirik     20   0 1904480 923856 102388 R  97.4   5.7   0:05.78 julia 
 303795 eirik     20   0 3575000   2.1g 102388 R 100.0  13.9   0:11.76 julia  
 303795 eirik     20   0 5196368   3.1g 102388 R 100.0  19.9   0:14.77 julia   
 303795 eirik     20   0 5225916   2.9g 102388 R 100.3  18.8   0:23.80 julia     
 303795 eirik     20   0 4660636   3.0g 102388 R 100.3  19.4   0:29.81 julia   
 303795 eirik     20   0 4728656   2.8g 102388 R  99.7  18.0   0:35.82 julia   
 303795 eirik     20   0 5291736   3.2g 102388 R 100.0  21.1   0:56.88 julia           
 303795 eirik     20   0 5262516   3.2g 102388 S   0.0  20.9   1:09.70 julia        

Note that the last row in this output is after the function has finished running, so the memory isn't cleared.

Julia 1.6.1 on laptop with DataFrames 1.2.2:


1
  4.342368 seconds (23.58 M allocations: 4.004 GiB, 16.49% gc time)
2
  3.280944 seconds (22.70 M allocations: 4.437 GiB, 12.37% gc time)
3
  3.384510 seconds (22.70 M allocations: 3.960 GiB, 14.74% gc time)
4
  3.275392 seconds (22.70 M allocations: 3.963 GiB, 14.59% gc time)
5
  3.231010 seconds (22.70 M allocations: 3.960 GiB, 14.70% gc time)

and from top

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND   
 344413 eirik     20   0  816872 276524  92776 R  15.9   1.7   0:01.79 julia 
 344413 eirik     20   0  908124 355792 101124 R  43.2   2.2   0:03.72 julia
 344413 eirik     20   0 4523948   2.3g 101124 R  99.7  15.1   0:09.72 julia  
 344413 eirik     20   0 3917056   2.6g 101124 R 100.0  17.1   0:24.75 julia 
 344413 eirik     20   0 3672040   2.3g 101124 R  99.7  14.8   0:27.75 julia  
 344413 eirik     20   0 5212180   3.0g 101124 R  99.3  19.6   0:36.75 julia   
 344413 eirik     20   0 5020940   2.8g 101124 R 100.0  18.0   0:42.76 julia    
 344413 eirik     20   0 4481016   2.7g 101124 R  99.7  17.7   0:48.77 julia   
 344413 eirik     20   0 5154240   3.3g 101816 S   0.0  21.6   1:10.82 julia     

eirikbrandsaas avatar Oct 08 '21 18:10 eirikbrandsaas

I have checked Julia 1.0.5, Julia 1.6.2, Julia 1.6.3, and Julia 1.8 both on Win10 and Linux and single and multi-threaded and I cannot reproduce your problems. Sometimes indeed GC does not clean up things immediately but if you run several times GC.gc() memory is reclaimed (note that because of the design of GC sometimes you need to run it several times to force full sweep).

Also I constantly have 16.40 M allocations (except Julia 1.0.5 which has more), so it is very strange that you have 22.70 M allocations for the same code.

bkamins avatar Oct 08 '21 20:10 bkamins

Hi @bkamins, thanks for looking into this. I just went back to check my results

This is what I did:

  1. Take a clean Julia 1.5.3 environment on my work server
] add Dataframes
using DataFrames
  1. I then copy-pasted the code from my first post, and I always get (22.70 M allocations: 5.495 GiB ... )`

I tried adding about 30 GC.gc() statements in the final loop, and it did drop memory usage somewhat, but still memory keeps climbing until it reaches some kind of limit.

# Without GC.gc() statements at the end of final loop
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
121770 m1eeb00   20   0  745368 352432 102212 R  90.8  0.1   0:03.36 julia  
121770 m1eeb00   20   0 3176688 1.953g 102212 R 100.0  0.8   0:09.50 julia 
121770 m1eeb00   20   0 3903148 2.509g 102212 R  99.7  1.0   0:15.64 julia 
121770 m1eeb00   20   0 6318932 3.169g 102212 R 100.0  1.3   0:40.18 julia 
121770 m1eeb00   20   0 6285304 4.345g 102212 S   0.0  1.7   0:56.83 julia 

# With GC.gc() statements at the end of final loop
 71459 m1eeb00   20   0  927784 437704 102016 R  59.2  0.2   0:03.96 julia 
 71459 m1eeb00   20   0 2221440 1.244g 102016 R 100.0  0.5   0:07.03 julia    
 71459 m1eeb00   20   0 3835136 1.684g 102016 R 100.0  0.7   0:16.23 julia  
 71459 m1eeb00   20   0 4007332 2.417g 102016 R 100.0  1.0   0:31.59 julia
 71459 m1eeb00   20   0 3929452 2.369g 102016 R 100.0  0.9   0:43.86 julia   
 71459 m1eeb00   20   0 5152936 3.415g 102016 R  99.7  1.4   0:59.21 julia  
 71459 m1eeb00   20   0 3933288 1.995g 102016 S   0.0  0.8   1:08.62 julia   

I'll try to verify again on my own laptop and see if I can get somebody else to run the code on the work server to verify there as well.

eirikbrandsaas avatar Oct 12 '21 13:10 eirikbrandsaas

@nalimilan - could you please have a look at this on your machine? Thank you!

bkamins avatar Oct 12 '21 17:10 bkamins

Here with Julia 1.6.2 on 64-bit Fedora 34 memory usage goes up to 3.3GB after running the loop. Then calling GC.gc() reduces it to 1.1GB, but it never goes back to the original level of 300MB. Not sure whether that qualifies as problematic or not. Anyway discussion in the Julia issue will probably be more productive.

nalimilan avatar Oct 12 '21 20:10 nalimilan

@eirikbrandsaas - what @nalimilan means (as it might not be clear from what he has written) - is that we both feel that if there is some memory leakage - it is most likely outside of DataFrames.jl and the issue should be reported in https://github.com/JuliaLang/julia. (if you would be willing to do this could you please ping us there so that we can monitor the issue - thank you!)

bkamins avatar Oct 12 '21 21:10 bkamins

https://github.com/JuliaLang/julia/issues/42566 is already about this problem, right?

nalimilan avatar Oct 13 '21 07:10 nalimilan

JuliaLang/julia#42566 is already about this problem, right?

@bkamins and @nalimilan Yes, that is the same issue. (I originally posted this in the #helpdesk channel on Slack, didn't get a response, posted this here, and a few days later someone opened the issue on the Julia github page).

eirikbrandsaas avatar Oct 13 '21 12:10 eirikbrandsaas

Thank you for tracing this!

bkamins avatar Oct 13 '21 13:10 bkamins