mantid
mantid copied to clipboard
Possible memory leak
Please read all comments to get full picture of the issue, as the problem appears to be more general than first described here.
Describe the bug
We've had reports of Mantid crashing due to memory use when users are auto-processing INTER experiment data while time slicing. I've done a bit of investigating and it looks to me like there may be a memory leak somewhere. This is most noticeable when we time slice in the GUI.
To Reproduce
The easiest way to replicate is as follows:
- Go to the
Interfaces->Reflectometry->ISIS Reflectometry. - On the Runs tab, in the first row of the table on the right hand side enter run
77064and angle0.8. - On the Event Handling tab, under the "Uniform slicing" section select the
(sec) slicesradio box and enter5to slice the data into 5 second slices. - On the Experiment Settings tab enter the following into the first row of the Settings Lookup table:
- ROI:
3-258 - ROI Detector IDS:
3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256
- ROI:
- Back on the Runs tab, click the process button. This should complete successfully and create a set of workspace groups. Take a note of the memory usage at this stage.
- Clear the workspace list using the button in Mantid Workbench. The memory use won't go down, but I don't think this in itself necessarily indicates a memory leak, as it can just be that the system hasn't retrieved the memory yet. To demonstrate the leak, in the Reflectometry GUI click to process the row again. The memory usage should continue to build up. If you keep doing this in a loop (5 or 6 times usually, clearing the workspace list in between) the memory completely fills up and Mantid eventually crashes.
If I do the above without the GUI (i.e. calling ReflectometryISISLoadAndProcess from a script while time-slicing and clearing workspaces between each run) then the memory doesn't seem to build up very much at all. If I run the algorithm directly from the algorithm dialog then the memory does build up more quickly and eventually causes a crash.
Expected behavior
Repeated processing of the data should not cause the memory to fill up and cause a crash if the workspace list is being cleared in between.
Platform/Version (please complete the following information):
- IDAaaS, can't replicate on Windows, need to test on Linux and potentially Mac
- Mantid Version - seems to be a problem in the Nightly, can't replicate it in version 6.9.1 or earlier.
Additional Information Another way I've tested is to run this script repeatedly, which does not perform the time slicing, clearing the workspaces in between:
from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np
num = 77064
for i in range(50):
num = num + 1
ReflectometryISISLoadAndProcess(InputRunList=str(num),ThetaIn=0.8,ROIDetectorIDs="3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256",ProcessingInstructions="3-258")
I haven't been able to crash Mantid using this script, though, even though the memory fills up to over 90% if run repeatedly. The memory builds up pretty slowly when tested in this way.
@cailafinn would you be happy to double check this on IDAaaS and check if you see the same behaviour on your Linux machine (and maybe Mac)? I would recommend using the first set of testing instructions (i.e. the time slicing example). While the script I've provided for the second way of causing the memory to fill up gets to over 90%, I can't seem to get it to the point where it crashes Mantid, so I think the first set of instructions is the most reliable way to demonstrate the problem.
Reproduced on Ubuntu 22.04 (Nightly conda package). Took two repeats of clicking Process to crash on a system with 32GiB of RAM. When the ADS was cleared, memory usage dropped somewhat, but not significantly. See below.
Breakdowns:
Nightly:
- Start:
- Mantid: 0.5GiB
- System: 5.2GiB
- 1st Process:
- Mantid: 15.8GiB
- System: 20.4GiB
- ADS Clear
- Mantid: 13.3GiB
- System: 18.1GiB
- 2nd Process
- Usage rose over 5-7 seconds before reaching the limits of the system until workbench froze and was crashed by the OOM Killer.
V9.6.1 and earlier
- Start:
- Mantid: 0.4GiB
- System: 1.1GiB
- 1st Process:
- Mantid: 15.8GiB
- System: 16.4GiB
- ADS Clear
- Mantid: 13.3GiB
- System: 14.0GiB
- 2nd Process:
- Mantid: 15.8GiB
- System: 16.5GiB
- 3rd Process:
- Mantid: 15.7GiB
- System: 16.5GiB
This would appear to be a regression.
As Caila has suggested, we should see if this PR to address issues with jemalloc has any impact on this issue. We've not made any changes to the Reflectometry code since 6.9.1 that seem like they should be causing this, and it isn't happening on Windows. I'll test on IDAaaS when that PR has gone in and see where we are then.
This doesn't appear to have been helped by the jemalloc change. We're still waiting on results from a Valgrind scan. In the meantime I've done some more testing on IDAaaS and have found this alternative way to replicate the behaviour:
- Load data using
LoadEventNexuswith the following settings:- Filename =
INTER77064 - LoadMonitors =
True - MonitorsLoadOnly =
Histogram
- Filename =
- From the algorithm dialog, run
ReflectometrySliceEventWorkspacewith the following settings:- InputWorkspace = name of loaded workspace
- MonitorWorkspace = name of loaded monitor workspace
- UseNewFilterAlgorithm =
True(note the memory builds up regardless of how this is set) - TimeInterval =
5
It should show the same behaviour as the Reflectometry GUI. If ReflectometrySliceEventWorkspace is called repeatedly via a script, though, then the memory doesn't seem to build up in the same way, it seems to be necessary to call it from a GUI.
Depending on the results of the Valgrind scan, if further investigation is needed it will need to be done by someone with access to a Linux machine as I still can't replicate on Windows.
I've done a bit more testing on IDAaaS and this is starting to look like a more general problem. I can replicate it using the LoadISISNexus algorithm to load a WISH dataset.
I've found that if you run the following script then it doesn't build up the memory:
for i in range(3):
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
AnalysisDataService.clear()
However if you do the above in any of the following ways then the memory does build up:
- Run
LoadISISNexusvia the algorithm dialog and clear the ADS using the Clear button. - Run
LoadISISNexusvia the algorithm dialog and clear the ADS by runningAnalysisDataService.clear()in the script editor. - Run
LoadISISNexusvia the script editor and clear the ADS using the Clear button.
This doesn't seem to happen in version 6.9.1 of Mantid, and again doesn't happen on Windows. When the memory fills up with this test it seems to eventually cause Mantid to freeze and become unresponsive on IDAaaS rather than crash.
The steps were followed in IDAaaS using Valgrind memory analyser tool to check for any memory leaks. The main observation was that there was a discrepancy in the amount of memory reported as leaked by Valgrind and the memory usage reported by the Mantid workbench as shown below. As can be seen, there has not been such a significant memory leak in this case.
- Valgrind summary ->
- System memory usage reported by Mantid at the end of test step 6 ->
Valgrind log file for reference 37401_valgrind.zip
One billion permutations later, here are my results. Testing on macOS to follow.
Testing on Ubuntu 22.04. 32GiB RAM.
Run Script Once: Loop Loading Call
for i in range(15):
AnalysisDataService.clear()
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
- Does not crash the program, even if the same load and clear function is performed multiple times within the script.
Run Script Multiple Times: Single Loading Call
# Repeatedly ran 3 times.
AnalysisDataService.clear()
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
- With each re-run, the memory usage was seen to drop by around 2.3GiB when the ADS was cleared.
- Memory usage then rose again. This could only be repeated 3 times before Mantid took up too much memory and was killed by the OoM Killer.
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.3 | 14.8 | 13.1 | 26.2 | 23.9 | CRASH | ||||
| 6.9.1 | 0.3 | 14.8 | 13.4 | 26.2 | 23.8 | CRASH | ||||
| 6.7 | 0.4 | 14.6 | 12.3 | 14.8 | 12.6 | 26.3 | 23.8 | 26.0 | 24.1 | CRASH |
- Behaviour appears to be the same.
Run Script Multiple Times: Swap Clear Order
# Repeatedly ran 4 times.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
AnalysisDataService.clear()
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.5 | 14.8 | 12.3 | 26.2 | 23.6 | 25.8 | 23.8 | CRASH | N/A | N/A |
| 6.9.1 | 0.4 | 14.8 | 12.3 | 25.7 | 23.7 | 25.9 | 23.9 | 25.8 | 24.1 | CRASH |
- Behaviour is similar. This way around was a bit more variable for some reason, but I don't think this one is specifically a regression.
No Script
# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button.
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.4 | 14.8 | 12.3 | 26.2 | 23.7 | CRASH | ||||
| 6.9.1 | 0.4 | 14.8 | 12.3 | 14.5 | 12.5 | 14.6 | 12.7 | 14.7 | 12.9 | 14.7 |
| 6.7 | 0.4 | 14.8 | 12.3 | 14.8 | 12.4 | 14.8 | 12.6 | 14.7 | 12.7 | 14.7 |
- Okay, beginning to see a difference here.
Load with Algorithm Dialog, Clear From Script
# Click Load and load WISH57086 from the algorithm dialog
# Run below from a script
AnalysisDataService.clear()
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.5 | 14.8 | 12.3 | 26.2 | 24.9 | CRASH | ||||
| 6.9.1 | 0.5 | 14.7 | 12.3 | 14.8 | 12.5 | 14.7 | 13.1 | 14.5 | 13.2 | 14.7 |
- Same behaviour as with no scripting involved.
Load with a script, Clear from the Button
# Run the below script to load WISH57086
# Then clear using the button on the workbench GUI.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.5 | 14.7 | 12.3 | 26.2 | 23.7 | CRASH | ||||
| 6.9.1 | 0.4 | 14.6 | 13.4 | 26.2 | 23.7 | CRASH | ||||
| 6.7 | 0.4 | 13.8 | 12.3 | 26.2 | 23.7 | 26.2 | 23.9 | CRASH |
- Okay, so the total of all of this seems to tell us is that something has changed when loading from the algorithm dialog. Crashes occur on both versions when loading from scripts, regardless of what is done with the ADS clear. The only thing that stops the crashes from occurring in 6.9.1 is using the algorithm dialog to load the data.
Testing on macOS 14.4.1 (Intel). 32GB RAM.
No Script
# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button.
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.4 | 5.5 | 2.1 | 7.4 | 2.9 | 7.5 | 6.2 | 10.0 | 3.3 | 12.2 |
| 6.7 | 0.5 | 12.5 | 2.5 | 12.5 | 3.1 | 12.5 | 3.3 | 12.6 | etc. |
- Despite the workspace appearing in the "Workspaces" list as 20GB, it didn't appear to take all that up as RAM. I wonder if there's something different about how macOS handles the memory for large files?
- Regardless, there doesn't seem to be an issue on macOS, it would appear to be specifically on linux.
@cailafinn Are you launching with the workbench entry point, or using the mantidworkbench script?
workbench entry point on both OS'. Always using the version from conda.
We've noticed that the version of the Nightly from 24th May (which is the first to include the jemalloc pin) is looking much better on IDAaaS now, when started with the launcher from the Applications menu (FYI @sf1919). Tom has explained that launching the Nightly using the IDAaaS launcher sets up jemalloc, whereas using the workbench entry point does not use the script that sets up jemalloc.
Re-tested launching using mantidworkbench:
| Version | Start | 1st Run | Clear | 2nd Run | Clear | 3rd Run | Clear | 4th Run | Clear | 5th Run |
|---|---|---|---|---|---|---|---|---|---|---|
| Nightly | 0.5 | 14.7 | 12.3 | 26.2 | 23.7 | CRASH | ||||
N mw |
0.5 | 15.9 | 4.3 | 16.2 | 4.2 | 16.2 | 4.2 | 16.2 | 4.5 | 16.1 |
| 6.9.1 | 0.4 | 14.6 | 13.4 | 26.2 | 23.7 | CRASH | ||||
| 6.7 | 0.4 | 13.8 | 12.3 | 26.2 | 23.7 | 26.2 | 23.9 | CRASH |
Seems to fix the issue. Do we need to alter the launch instructions for linux? We currently recommend starting it using the workbench entry in the Downloads docs.
Some of my findings when loading and clearing all via the GUI (all nightly versions here are 6.9.20240524.1804): On Ubuntu:
- v6.9.1 is fine with and without
jemalloc(mantidworkbench andworkbench) - nightly version fills up memory and freezes with and without
jemalloc
On IDAaaS:
- nightly version
mantidworkbench(withjemalloc) appears to fill up the memory, but is actually OK because it gets release just in time - nightly version via
workbenchfreezes after a few iterations and eventually crashes - 6.9.1 seems to be fine in both cases. With
jemallocit seems to give the memory back much quicker than with the nightly. Maybe because we downgraded from 5.3.0 to 5.2.0.
With the further changes to IDAaaS last week how are things looking @thomashampson ?