mantid icon indicating copy to clipboard operation
mantid copied to clipboard

Possible memory leak

Open rbauststfc opened this issue 1 year ago • 15 comments
trafficstars

Please read all comments to get full picture of the issue, as the problem appears to be more general than first described here.

Describe the bug

We've had reports of Mantid crashing due to memory use when users are auto-processing INTER experiment data while time slicing. I've done a bit of investigating and it looks to me like there may be a memory leak somewhere. This is most noticeable when we time slice in the GUI.

To Reproduce

The easiest way to replicate is as follows:

  1. Go to the Interfaces -> Reflectometry -> ISIS Reflectometry.
  2. On the Runs tab, in the first row of the table on the right hand side enter run 77064 and angle 0.8.
  3. On the Event Handling tab, under the "Uniform slicing" section select the (sec) slices radio box and enter 5 to slice the data into 5 second slices.
  4. On the Experiment Settings tab enter the following into the first row of the Settings Lookup table:
    • ROI: 3-258
    • ROI Detector IDS: 3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256
  5. Back on the Runs tab, click the process button. This should complete successfully and create a set of workspace groups. Take a note of the memory usage at this stage.
  6. Clear the workspace list using the button in Mantid Workbench. The memory use won't go down, but I don't think this in itself necessarily indicates a memory leak, as it can just be that the system hasn't retrieved the memory yet. To demonstrate the leak, in the Reflectometry GUI click to process the row again. The memory usage should continue to build up. If you keep doing this in a loop (5 or 6 times usually, clearing the workspace list in between) the memory completely fills up and Mantid eventually crashes.

If I do the above without the GUI (i.e. calling ReflectometryISISLoadAndProcess from a script while time-slicing and clearing workspaces between each run) then the memory doesn't seem to build up very much at all. If I run the algorithm directly from the algorithm dialog then the memory does build up more quickly and eventually causes a crash.

Expected behavior

Repeated processing of the data should not cause the memory to fill up and cause a crash if the workspace list is being cleared in between.

Platform/Version (please complete the following information):

  • IDAaaS, can't replicate on Windows, need to test on Linux and potentially Mac
  • Mantid Version - seems to be a problem in the Nightly, can't replicate it in version 6.9.1 or earlier.

Additional Information Another way I've tested is to run this script repeatedly, which does not perform the time slicing, clearing the workspaces in between:

from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np

num = 77064
for i in range(50):
    num = num + 1
    ReflectometryISISLoadAndProcess(InputRunList=str(num),ThetaIn=0.8,ROIDetectorIDs="3001-3256,4001-4256,5001-5256,6001-6256,7001-7256,8001-8256,9001-9256,10001-10256",ProcessingInstructions="3-258")

I haven't been able to crash Mantid using this script, though, even though the memory fills up to over 90% if run repeatedly. The memory builds up pretty slowly when tested in this way.

rbauststfc avatar May 20 '24 11:05 rbauststfc

@cailafinn would you be happy to double check this on IDAaaS and check if you see the same behaviour on your Linux machine (and maybe Mac)? I would recommend using the first set of testing instructions (i.e. the time slicing example). While the script I've provided for the second way of causing the memory to fill up gets to over 90%, I can't seem to get it to the point where it crashes Mantid, so I think the first set of instructions is the most reliable way to demonstrate the problem.

rbauststfc avatar May 20 '24 15:05 rbauststfc

Reproduced on Ubuntu 22.04 (Nightly conda package). Took two repeats of clicking Process to crash on a system with 32GiB of RAM. When the ADS was cleared, memory usage dropped somewhat, but not significantly. See below.

Breakdowns:

Nightly:

  • Start:
    • Mantid: 0.5GiB
    • System: 5.2GiB
  • 1st Process:
    • Mantid: 15.8GiB
    • System: 20.4GiB
  • ADS Clear
    • Mantid: 13.3GiB
    • System: 18.1GiB
  • 2nd Process
    • Usage rose over 5-7 seconds before reaching the limits of the system until workbench froze and was crashed by the OOM Killer.

V9.6.1 and earlier

  • Start:
    • Mantid: 0.4GiB
    • System: 1.1GiB
  • 1st Process:
    • Mantid: 15.8GiB
    • System: 16.4GiB
  • ADS Clear
    • Mantid: 13.3GiB
    • System: 14.0GiB
  • 2nd Process:
    • Mantid: 15.8GiB
    • System: 16.5GiB
  • 3rd Process:
    • Mantid: 15.7GiB
    • System: 16.5GiB

This would appear to be a regression.

cailafinn avatar May 21 '24 13:05 cailafinn

As Caila has suggested, we should see if this PR to address issues with jemalloc has any impact on this issue. We've not made any changes to the Reflectometry code since 6.9.1 that seem like they should be causing this, and it isn't happening on Windows. I'll test on IDAaaS when that PR has gone in and see where we are then.

rbauststfc avatar May 21 '24 15:05 rbauststfc

This doesn't appear to have been helped by the jemalloc change. We're still waiting on results from a Valgrind scan. In the meantime I've done some more testing on IDAaaS and have found this alternative way to replicate the behaviour:

  1. Load data using LoadEventNexus with the following settings:
    • Filename = INTER77064
    • LoadMonitors = True
    • MonitorsLoadOnly = Histogram
  2. From the algorithm dialog, run ReflectometrySliceEventWorkspace with the following settings:
    • InputWorkspace = name of loaded workspace
    • MonitorWorkspace = name of loaded monitor workspace
    • UseNewFilterAlgorithm = True (note the memory builds up regardless of how this is set)
    • TimeInterval = 5

It should show the same behaviour as the Reflectometry GUI. If ReflectometrySliceEventWorkspace is called repeatedly via a script, though, then the memory doesn't seem to build up in the same way, it seems to be necessary to call it from a GUI.

rbauststfc avatar May 23 '24 15:05 rbauststfc

Depending on the results of the Valgrind scan, if further investigation is needed it will need to be done by someone with access to a Linux machine as I still can't replicate on Windows.

rbauststfc avatar May 23 '24 15:05 rbauststfc

I've done a bit more testing on IDAaaS and this is starting to look like a more general problem. I can replicate it using the LoadISISNexus algorithm to load a WISH dataset.

I've found that if you run the following script then it doesn't build up the memory:

for i in range(3):
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
    AnalysisDataService.clear()

However if you do the above in any of the following ways then the memory does build up:

  • Run LoadISISNexus via the algorithm dialog and clear the ADS using the Clear button.
  • Run LoadISISNexus via the algorithm dialog and clear the ADS by running AnalysisDataService.clear() in the script editor.
  • Run LoadISISNexus via the script editor and clear the ADS using the Clear button.

This doesn't seem to happen in version 6.9.1 of Mantid, and again doesn't happen on Windows. When the memory fills up with this test it seems to eventually cause Mantid to freeze and become unresponsive on IDAaaS rather than crash.

rbauststfc avatar May 24 '24 08:05 rbauststfc

The steps were followed in IDAaaS using Valgrind memory analyser tool to check for any memory leaks. The main observation was that there was a discrepancy in the amount of memory reported as leaked by Valgrind and the memory usage reported by the Mantid workbench as shown below. As can be seen, there has not been such a significant memory leak in this case.

  • Valgrind summary ->

leaks

  • System memory usage reported by Mantid at the end of test step 6 ->

memory_usage

Valgrind log file for reference 37401_valgrind.zip

warunawickramasingha avatar May 24 '24 16:05 warunawickramasingha

One billion permutations later, here are my results. Testing on macOS to follow.

Testing on Ubuntu 22.04. 32GiB RAM.

Run Script Once: Loop Loading Call

for i in range(15):
    AnalysisDataService.clear()
    LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
  • Does not crash the program, even if the same load and clear function is performed multiple times within the script.

Run Script Multiple Times: Single Loading Call

# Repeatedly ran 3 times.
AnalysisDataService.clear()
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
  • With each re-run, the memory usage was seen to drop by around 2.3GiB when the ADS was cleared.
  • Memory usage then rose again. This could only be repeated 3 times before Mantid took up too much memory and was killed by the OoM Killer.
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.3 14.8 13.1 26.2 23.9 CRASH
6.9.1 0.3 14.8 13.4 26.2 23.8 CRASH
6.7 0.4 14.6 12.3 14.8 12.6 26.3 23.8 26.0 24.1 CRASH
  • Behaviour appears to be the same.

Run Script Multiple Times: Swap Clear Order

# Repeatedly ran 4 times.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
AnalysisDataService.clear()
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.8 12.3 26.2 23.6 25.8 23.8 CRASH N/A N/A
6.9.1 0.4 14.8 12.3 25.7 23.7 25.9 23.9 25.8 24.1 CRASH
  • Behaviour is similar. This way around was a bit more variable for some reason, but I don't think this one is specifically a regression.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button. 
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.4 14.8 12.3 26.2 23.7 CRASH
6.9.1 0.4 14.8 12.3 14.5 12.5 14.6 12.7 14.7 12.9 14.7
6.7 0.4 14.8 12.3 14.8 12.4 14.8 12.6 14.7 12.7 14.7
  • Okay, beginning to see a difference here.

Load with Algorithm Dialog, Clear From Script

# Click Load and load WISH57086 from the algorithm dialog
# Run below from a script
AnalysisDataService.clear()
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.8 12.3 26.2 24.9 CRASH
6.9.1 0.5 14.7 12.3 14.8 12.5 14.7 13.1 14.5 13.2 14.7
  • Same behaviour as with no scripting involved.

Load with a script, Clear from the Button

# Run the below script to load WISH57086
# Then clear using the button on the workbench GUI.
LoadISISNexus(Filename="WISH57086", LoadMonitors="Include", OutputWorkspace="test")
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.7 12.3 26.2 23.7 CRASH
6.9.1 0.4 14.6 13.4 26.2 23.7 CRASH
6.7 0.4 13.8 12.3 26.2 23.7 26.2 23.9 CRASH
  • Okay, so the total of all of this seems to tell us is that something has changed when loading from the algorithm dialog. Crashes occur on both versions when loading from scripts, regardless of what is done with the ADS clear. The only thing that stops the crashes from occurring in 6.9.1 is using the algorithm dialog to load the data.

cailafinn avatar May 28 '24 11:05 cailafinn

Testing on macOS 14.4.1 (Intel). 32GB RAM.

No Script

# Click Load and load WISH57086 from the algorithm dialog
# Clear the ADS using the clear button. 
Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.4 5.5 2.1 7.4 2.9 7.5 6.2 10.0 3.3 12.2
6.7 0.5 12.5 2.5 12.5 3.1 12.5 3.3 12.6 etc.
  • Despite the workspace appearing in the "Workspaces" list as 20GB, it didn't appear to take all that up as RAM. I wonder if there's something different about how macOS handles the memory for large files?
  • Regardless, there doesn't seem to be an issue on macOS, it would appear to be specifically on linux.

cailafinn avatar May 28 '24 12:05 cailafinn

@cailafinn Are you launching with the workbench entry point, or using the mantidworkbench script?

thomashampson avatar May 28 '24 13:05 thomashampson

workbench entry point on both OS'. Always using the version from conda.

cailafinn avatar May 28 '24 13:05 cailafinn

We've noticed that the version of the Nightly from 24th May (which is the first to include the jemalloc pin) is looking much better on IDAaaS now, when started with the launcher from the Applications menu (FYI @sf1919). Tom has explained that launching the Nightly using the IDAaaS launcher sets up jemalloc, whereas using the workbench entry point does not use the script that sets up jemalloc.

rbauststfc avatar May 28 '24 13:05 rbauststfc

Re-tested launching using mantidworkbench:

Version Start 1st Run Clear 2nd Run Clear 3rd Run Clear 4th Run Clear 5th Run
Nightly 0.5 14.7 12.3 26.2 23.7 CRASH
N mw 0.5 15.9 4.3 16.2 4.2 16.2 4.2 16.2 4.5 16.1
6.9.1 0.4 14.6 13.4 26.2 23.7 CRASH
6.7 0.4 13.8 12.3 26.2 23.7 26.2 23.9 CRASH

Seems to fix the issue. Do we need to alter the launch instructions for linux? We currently recommend starting it using the workbench entry in the Downloads docs.

cailafinn avatar May 28 '24 14:05 cailafinn

Some of my findings when loading and clearing all via the GUI (all nightly versions here are 6.9.20240524.1804): On Ubuntu:

  • v6.9.1 is fine with and without jemalloc (mantidworkbench and workbench)
  • nightly version fills up memory and freezes with and without jemalloc

On IDAaaS:

  • nightly version mantidworkbench (with jemalloc) appears to fill up the memory, but is actually OK because it gets release just in time
  • nightly version via workbench freezes after a few iterations and eventually crashes
  • 6.9.1 seems to be fine in both cases. With jemalloc it seems to give the memory back much quicker than with the nightly. Maybe because we downgraded from 5.3.0 to 5.2.0.

thomashampson avatar May 28 '24 15:05 thomashampson

With the further changes to IDAaaS last week how are things looking @thomashampson ?

sf1919 avatar Jun 12 '24 09:06 sf1919