Suspected memory leakage in highspy (python = 3.10)
As I solved a large mip model on more datasets on a jupyter notebook on my wsl on windows, the memory it used increased. At last, the task will fail because it used up the memory(8G). But if I run the notebook one by one and each one just solve one dataset, it can work successfully.
I guess the memory won't be completely cleared up when a model ends its solving.
If u need more correspoding information, u can tell me how to get them.
What version of HiGHS are you using?
highspy 1.8.1
Would you be able to share your notebook and data? This way I can try to reproduce the behaviour locally on my windows with wsl.
sorry for reply now for I being seriously ill recently. I have done another experient. In one notebook, I continuely run following function for solving nqueens for 40 times, the memory usage improves 1GB. And the memory used by the function should be freed after its calling.
def nqueens(N):
h = highspy.Highs()
h.silent()
x = h.addBinaries(N, N)
h.addConstrs(x.sum(axis=0) == 1) # each row has exactly one queen
h.addConstrs(x.sum(axis=1) == 1) # each col has exactly one queen
y = np.fliplr(x)
h.addConstrs(x.diagonal(k).sum() <= 1 for k in range(-N + 1, N)) # each diagonal has at most one queen
h.addConstrs(y.diagonal(k).sum() <= 1 for k in range(-N + 1, N)) # each 'reverse' diagonal has at most one queen
h.solve()
sol = h.vals(x)
Hi @yokawhhh, @galabovaa,
I can reproduce this issue with this the following test script on Windows WSL and a physical Linux machine
Windows WSL (Ubuntu)
total used free shared buff/cache available
Mem: 31Gi 869Mi 30Gi 3.6Mi 243Mi 30Gi
Swap: 8.0Gi 0B 8.0Gi
[...]
total used free shared buff/cache available
Mem: 31Gi 1.6Gi 29Gi 3.6Mi 244Mi 29Gi
Swap: 8.0Gi 0B 8.0Gi
Physical Linux machine (Ubuntu)
total used free shared buff/cache available
Mem: 31Gi 17Gi 10Gi 116Mi 3.0Gi 13Gi
Swap: 2.0Gi 6.1Mi 2.0Gi
[...]
total used free shared buff/cache available
Mem: 31Gi 18Gi 10Gi 116Mi 3.0Gi 12Gi
Swap: 2.0Gi 6.1Mi 2.0Gi
(In case it's significant, note that I only see the decrease in the available column, not the free column)
I also seem to get the same behavior even if I move h = highspy.Highs() and h.silent() outside the loop and call h.clearModel() at the beginning of each pass of the loop instead.
I have a new MRE for this (using nqueens300x300.mps.txt)
import highspy
import subprocess
def nqueens300x300():
h = highspy.Highs()
h.silent()
h.readModel("nqueens300x300.mps")
h.run()
for i in range(40):
print(i)
nqueens300x300()
subprocess.run(["free", "-m" ])
Adds about 300 MB to memory usage by the end. Less than half of the previous example but still noticeable.
The following C++ does not seem to have this effect. (As we should hope!)
#include <cstdlib>
#include "Highs.h"
int main(){
Highs highs;
for (int i = 0; i < 40; ++i) {
std::cout << i << std::endl;
highs.setOptionValue("log_to_console", "false");
highs.readModel("nqueens300x300.mps");
highs.run();
std::system("free -m");
}
}
Any ideas @mathgeekcoder, even just for where to dig?
Thanks @BenChampion for the heads up. I had a quick look and can somewhat reproduce too. I'm not seeing it with the mps file via python, but with the highspy construction of nqueens. I've not tried c++ yet. I've tested with WSL on windows.
I think there's multiple issues at play here.
-
I found a bug in highspy (my fault!), which causes a cyclic reference of the
highsobject that might prevent the garbage collector to clean everything up. However, fixing this doesn't make any difference to the memory leak. -
The "leak" is much less (practically zero), if you don't actually solve the problem.
-
The "leak" is also practically zero, if you solve the LP relaxation instead of the IP
-
The "leak" occurs regardless if I use the "pythonic" wrappers, or the raw C++ bindings via python.
-
gc.collect()doesn't free everything, I also needed to callmalloc_trim(0)to see the memory drop
I'm using psutil to report the memory usage for my process (so it isolates memory usage):
def memory():
import os
import psutil
# Get the current process
process = psutil.Process(os.getpid())
# Retrieve memory usage in MB
memory_usage_mb = process.memory_info().rss / (1024 * 1024)
print(f"Memory Usage: {memory_usage_mb:.2f} MB")
I'll continue to debug too. It's an interesting one!
Thanks for looking @mathgeekcoder!
With your snippet and calling memory() at the end of the body of the for loop, I see similar behavior with the .mps file as I did with calling free -m directly. The memory usage gets to 1.1GB on my WSL system on windows and only 0.6GB on a physical Linux box, both much larger than the 300MB I reported previously. (There may be other confounding variables too; I haven't ensured matching Python version etc.)
One drawback of my MRE is that I don't check the return value of readModel. If it doesn't find the .mps file it happily continues and of course doesn't manifest the increasing memory usage (and terminates quite quickly).
I think there's multiple issues at play here.
That sounds likely to me. On both test machines I did notice the memory usage plateauing rather than uniformly increasing after each pass through the loop.
Ah!! You're correct @BenChampion, it wasn't finding the .mps file. My silly mistake. Fixing that helped replicate the issue with mps.
That said, I believe I can also replicate this in C++, and I think threading, garbage collection and glibc is the "cause".
The main issue is threading (gc and glibc just made it harder to see). I still need to investigate why the threading in linux is keeping hold of the memory (valgrind doesn't notice a leak for me).
Numbers below are in MB. Manually forcing python's garbage collection (gc) cleans up python stuff, while calling malloc_limit(0) cleans up available memory that glibc is holding for performance reasons.
| iteration | original | original gc/malloc | 1 thread | 1 thread gc | 1 thread gc/malloc |
|---|---|---|---|---|---|
| 0 | 189 | 49 | 135 | 132 | 31 |
| 39 | 889 | 357 | 351 | 140 | 30 |
import highspy
import gc
import ctypes
import os
import psutil
malloc_trim = ctypes.CDLL("libc.so.6").malloc_trim
def memory():
process = psutil.Process(os.getpid())
memory_usage_mb = process.memory_info().rss / (1024 * 1024)
print(f"{memory_usage_mb:.2f} MB")
def nqueens300x300():
h = highspy.Highs()
h.silent()
h.setOptionValue("threads", 1)
h.readModel("nqueens300x300.mps")
h.run()
#highspy._Highs.resetGlobalScheduler(True) # doesn't seem to help
for i in range(40):
print(i, end='\t')
nqueens300x300()
gc.collect()
malloc_trim(0)
memory()
Okay, so I think I've worked out the threading issue. Though, now I'm not sure if this is the same problem in the original ticket.
@BenChampion can you try running export MALLOC_ARENA_MAX=1 before running your python script? This is not a fix, but might help determine what's going on.
When I do this, I get the following:
| iteration | original | original gc/malloc | 1 thread | 1 thread gc | 1 thread gc/malloc |
|---|---|---|---|---|---|
| 0 | 164 | 31 | 159 | 156 | 30 |
| 39 | 453 | 30 | 456 | 165 | 30 |
That is, memory doesn't increase even with multiple threads (after we clean garbage collect and release glibc cache).
My silly mistake. Fixing that helped replicate the issue with mps
My bad for laziness around error handling!
After export MALLOC_ARENA_MAX=1, adding an "original gc" column, and running on a physical Linux machine (not WSL)
| iteration | original | original gc | original gc/malloc | 1 thread | 1 thread gc | 1 thread gc/malloc |
|---|---|---|---|---|---|---|
| 0 | 132 | 132 | 27 | 117 | 117 | 26 |
| 39 | 454 | 160 | 28 | 433 | 112 | 26 |
(That is, similar results.)
Did you manage to reproduce this in C++?
And just to make sure, is the following summary/interpretation of our findings so far correct?
- Although it seems to make no difference in our tests, there's a cyclic dependency
highspyis creating that might stop Python from freeing memory in some cases. (For my own interest, could you point me to the relevant line(s)?) - Otherwise, it looks like most of the symptoms in our tests are coming from Python and glibc management of memory that is "in theory" available.
(That is, similar results.)
Thanks for confirming @BenChampion!
Did you manage to reproduce this in C++?
Yes.
- Although it seems to make no difference in our tests, there's a cyclic dependency
highspyis creating that might stop Python from freeing memory in some cases. (For my own interest, could you point me to the relevant line(s)?)
Yes: HighsCallback.highs. Instead of pointing directly to the relevant highs object, it probably should use weakref.ref(highs). There's also HighspyArray.highs, though only the callback has the cyclic dependency. That said, this cyclic dependency issue could be avoided if the user calls clearCallbacks etc. once they're done - but that's not particularly nice.
- Otherwise, it looks like most of the symptoms in our tests are coming from Python and glibc management of memory that is "in theory" available.
That's my understanding too. It's not really a bug, but it's using more memory than people might expect.
This might also be the original issue, but I'd imagine the python garbage collector and glibc clean-up kicking in before you run out of memory. It's not a memory leak. It's a side-effect of our threading model and glibc allocator.
There's a few things we could do if we wanted to avoid this behaviour. We could limit the glibc arena programmatically, perhaps we could reconsider how we're doing work-stealing etc. on our multiple threads, or we could use a different malloc allocator, there's potential performance benefits to that too (see #2476).
FYI: I've tried a few of them via my C++ test, i.e., injecting the different allocators via:
LD_PRELOAD=lib*malloc.so ./highs
| iterations | glibc | glibc ARENA=1 | mimalloc | jemalloc | tcmalloc |
|---|---|---|---|---|---|
| 0 | 163 MB | 138 MB | 209 MB | 142 MB | 218 MB |
| 39 | 626 MB | 148 MB | 254 MB | 203 MB | 241 MB |
One challenge is that the arena concept helps speed up allocations across multiple threads (so we probably want more than one), but we also have work stealing that might fragment the memory allocations across the different arena heaps. The other allocators have similar concepts, and we could possibly tune whatever one we wanted to fit our needs best.
Great to see @BenChampion and @mathgeekcoder sparking off each other to investigate this! 🤩
Yes:
HighsCallback.highs. Instead of pointing directly to the relevanthighsobject, it probably should useweakref.ref(highs). There's alsoHighspyArray.highs, though only the callback has the cyclic dependency. That said, this cyclic dependency issue could be avoided if the user callsclearCallbacksetc. once they're done - but that's not particularly nice.
I can create a new issue for this. (I can also try making the required changes.)
This might also be the original issue, but I'd imagine the python garbage collector and glibc clean-up kicking in before you run out of memory.
I would have thought that too, but it does seem like glibc isn't always that smart and can still exhaust memory.
An example of this happening in practice in another project (avoiding the backlink, I hope, by inserting www.!)
In any case, I propose we close this issue for now since there doesn't appear to be an actual memory leak and the above investigations provide several potential workarounds for affected users.
I can create a new issue for this. (I can also try making the required changes.)
Great! Should be fairly straightforward. I'm happy to review the PR.
I would have thought that too, but it does seem like
glibcisn't always that smart and can still exhaust memory.In any case, I propose we close this issue for now since there doesn't appear to be an actual memory leak and the above investigations provide several potential workarounds for affected users.
Yeah, I agree. That said, I think it'll be worth revisiting memory allocation and threading in the future for better performance and reduced memory overhead/fragmentation. Sounds rather fascinating, so I'll add that to my list of investigation TODOs :)