mantid icon indicating copy to clipboard operation
mantid copied to clipboard

Mantid Crashes when trying to use Instrument view OpenGL 3D view on Rocky IDAaaS

Open MialLewis opened this issue 1 year ago • 15 comments

Describe the bug Mantid Crashes when trying to use Instrument view OpenGL 3D view on Rocky IDAAS.

To Reproduce

  1. Create an IDAaaS Rocky 8 Instance with no GPU: https://dev.analysis.stfc.ac.uk/
  2. Load some data (I Used MAR11060).
  3. Open Instrument View
  4. Under Display Settings, select Use OpenGL.
  5. Using the top drop down, change from Cylindrical Y to Full 3D.
  6. Observe crash.

Expected behavior Mantid should not crash.

Platform/Version (please complete the following information): IDAaaS Rocky8

MialLewis avatar Oct 05 '23 15:10 MialLewis

I think this issue is related: https://github.com/mantidproject/mantid/issues/36486

SilkeSchomann avatar Dec 18 '23 09:12 SilkeSchomann

@SilkeSchomann has this now been fixed? I know #36486 has been closed.

sf1919 avatar Jan 22 '24 12:01 sf1919

@sf1919 On a MAPS>Excitations Powder instance I can still reproduce it, but not on a POLREF>Reflectometry Rocky 8 testing instance. Both seem to use Rocky Linux 8.9 (Green Obsidian), though.

SilkeSchomann avatar Jan 22 '24 14:01 SilkeSchomann

I will update the milestone

sf1919 avatar Jan 22 '24 15:01 sf1919

Update: If I disable loading libjemalloc, then it seems to work in all IDAaaS workspaces for me. E.g. in workspaces where it crashes (not all workspaces causes a crash, but the "Excitations Powder" definitely does), running:

/opt/mambaforge/bin/conda run --prefix /opt/mantidworkbench6.9 /opt/mantidworkbench6.9/bin/python /opt/mantidworkbench6.9/bin/workbench

in the a terminal opens up a Mantid where I can view the 3D instrument view without crashing.

But if I set the LD_PRELOAD environment variable as the mantidworkbench script does, e.g. run:

LD_PRELOAD=/opt/mantidworkbench6.9/lib/libjemalloc.so.2 /opt/mambaforge/bin/conda run --prefix /opt/mantidworkbench6.9 /opt/mantidworkbench6.9/bin/python /opt/mantidworkbench6.9/bin/workbench

then if I open instrument view with the "Use OpenGL in Instrument View" option checked then it crashes.


Looking through the history, we use jemalloc on Linux in order to better free memory when it is no longer needed (PR #29232) and (maybe) to better avoid memory leaks. However, in the code the version is pinned to 5.2.0 from 2019 whilst there is a newer release 5.3.0 from 2022. The pin is because of hangs on older Linux (Ubuntu 2018 and RedHat 7 according to the code). We don't use these versions at ISIS anymore but the analysis cluster at SNS still uses RedHat 7.9.


Perhaps as a test, we could bump the jemalloc version 5.3.0 to see if it still causes the OpenGL crash?

mducle avatar May 13 '24 10:05 mducle

@mducle I've set off a build for lib_jemalloc=5.3.0 https://builds.mantidproject.org/job/build_packages_from_branch/803/

if it builds, it will be uploaded to our conda under the label jemalloc_update so we can test it easily on idaaas

jhaigh0 avatar May 14 '24 13:05 jhaigh0

You can now install conda install -c mantid/label/jemalloc_update mantidworkbench to test. I've tried it very briefly on ALF>Excitations Powder, loaded some ALF data, opened instrument view, turned open GL on and switched to full 3D with no crash. Obviously we need more testing though.

jhaigh0 avatar May 14 '24 15:05 jhaigh0

@jhaigh0 Do you think it's worth testing it still behaves on a GPU instance as well (e.g. WISH > Single-crystal (GPU)) - if so I can give it a go?

RichardWaiteSTFC avatar May 15 '24 08:05 RichardWaiteSTFC

@RichardWaiteSTFC Yes that's a good idea

jhaigh0 avatar May 15 '24 12:05 jhaigh0

It would also be good to see if it helps with issue #37370

sf1919 avatar May 15 '24 15:05 sf1919

Works OK on WISH GPU session (as expected) - followed instructions here, #37370 and #37252

RichardWaiteSTFC avatar May 16 '24 08:05 RichardWaiteSTFC

Does this behavior still happen when you use the wrapper scripts that are installed? The are in ..../bin/mantidworkbench. They are bash and currently do the following (in order):

  1. LD_PRELOAD libjemalloc
  2. detect thinlinc and launch with vglrun (or local variant)
  3. add ${CONDA_PREFIX}/bin:${CONDA_PREFIX}/lib:${CONDA_PREFIX}/plugins to the PYTHON_PATH <- probably unnecessary for installs
  4. have an option to start through gdb (useful for debugging user issues)
  5. start (with a bunch of environment variables set) python ${CONDA_PREFIX}/bin/workbench $@

peterfpeterson avatar May 16 '24 14:05 peterfpeterson

There is an additional issue hiding in here. Currently we get a segfault when starting mantidworkbench on rhel9 (fine on rhel7). Commenting out the lines from the launch script

# Define parameters for jemalloc
LOCAL_PRELOAD=/opt/anaconda/envs/mantid-dev/lib/libjemalloc.so.2
if [ -n "${LD_PRELOAD}" ]; then
    LOCAL_PRELOAD=${LOCAL_PRELOAD}:${LD_PRELOAD}
fi

fixes it. Both rhel7 and rhel9 are using the same version of libjemalloc (v5.3.0, build hcb278e6_0).

With a little more investigation, it appears to be an issue using /usr/lib64/ld-linux-x86-64.so.2 which is outside of the conda tree. On rhel, ld is provided by glibc which is v2.17 on rhel7 and v2.34 on rhel9.

peterfpeterson avatar May 20 '24 12:05 peterfpeterson

Also not that conda info says (truncated) on rhel7

       virtual packages : __archspec=1=skylake_avx512
                          __cuda=12.2=0
                          __glibc=2.17=0
                          __linux=3.10.0=0
                          __unix=0=0

and on rhel9

       virtual packages : __archspec=1=cascadelake
                          __cuda=12.4=0
                          __glibc=2.34=0
                          __linux=5.14.0=0
                          __unix=0=0

peterfpeterson avatar May 20 '24 13:05 peterfpeterson

Using the new package with libjemalloc=5.2.0 still causes a crash when using InstrumentView with 3D enabled on IDAaaS virtual machines without a physical graphics card running Rocky8 using software OpenGL. Running:

LD_PRELOAD=/envs/mantje32/lib/libjemalloc.so.2 python -m workbench.app.main --single-process

under gdb gives the following stack trace:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007fff8d931809 in execute_list () from /usr/lib64/dri/swrast_dri.so
#2  0x00007fff8d981e39 in _mesa_CallList () from /usr/lib64/dri/swrast_dri.so
#3  0x00007fff75f83391 in MantidQt::MantidWidgets::InstrumentRenderer::renderInstrument(std::vector<bool, std::allocator<bool> > const&, bool, bool) ()
   from /envs/mantje32/plugins/qt5/../../lib/libMantidQtWidgetsInstrumentViewQt5.so

In addition, the following OpenGL errors were reported just before the crash:

OpenGL-[Error] OpenGL error detected in GLDisplay::setRenderingOptions(): invalid operationOpenGL-[Error]
OpenGL error detected in GL3DWidget::draw3D()[begin]: invalid operation

I'm not sure if these are red herrings though. I'm trying to hunt through the mesa code but it's looking pretty complicated...

mducle avatar May 22 '24 15:05 mducle

Closed by #37437.

github-actions[bot] avatar Jun 05 '24 13:06 github-actions[bot]