mantid
mantid copied to clipboard
Mantid Crashes when trying to use Instrument view OpenGL 3D view on Rocky IDAaaS
Describe the bug Mantid Crashes when trying to use Instrument view OpenGL 3D view on Rocky IDAAS.
To Reproduce
- Create an IDAaaS Rocky 8 Instance with no GPU: https://dev.analysis.stfc.ac.uk/
- Load some data (I Used MAR11060).
- Open Instrument View
- Under Display Settings, select
Use OpenGL
. - Using the top drop down, change from
Cylindrical Y
toFull 3D
. - Observe crash.
Expected behavior Mantid should not crash.
Platform/Version (please complete the following information): IDAaaS Rocky8
I think this issue is related: https://github.com/mantidproject/mantid/issues/36486
@SilkeSchomann has this now been fixed? I know #36486 has been closed.
@sf1919 On a MAPS>Excitations Powder
instance I can still reproduce it, but not on a POLREF>Reflectometry Rocky 8 testing
instance. Both seem to use Rocky Linux 8.9 (Green Obsidian), though.
I will update the milestone
Update: If I disable loading libjemalloc
, then it seems to work in all IDAaaS workspaces for me. E.g. in workspaces where it crashes (not all workspaces causes a crash, but the "Excitations Powder" definitely does), running:
/opt/mambaforge/bin/conda run --prefix /opt/mantidworkbench6.9 /opt/mantidworkbench6.9/bin/python /opt/mantidworkbench6.9/bin/workbench
in the a terminal opens up a Mantid where I can view the 3D instrument view without crashing.
But if I set the LD_PRELOAD
environment variable as the mantidworkbench
script does, e.g. run:
LD_PRELOAD=/opt/mantidworkbench6.9/lib/libjemalloc.so.2 /opt/mambaforge/bin/conda run --prefix /opt/mantidworkbench6.9 /opt/mantidworkbench6.9/bin/python /opt/mantidworkbench6.9/bin/workbench
then if I open instrument view with the "Use OpenGL in Instrument View" option checked then it crashes.
Looking through the history, we use jemalloc
on Linux in order to better free memory when it is no longer needed (PR #29232) and (maybe) to better avoid memory leaks. However, in the code the version is pinned to 5.2.0
from 2019 whilst there is a newer release 5.3.0
from 2022. The pin is because of hangs on older Linux (Ubuntu 2018 and RedHat 7 according to the code). We don't use these versions at ISIS anymore but the analysis cluster at SNS still uses RedHat 7.9.
Perhaps as a test, we could bump the jemalloc
version 5.3.0 to see if it still causes the OpenGL crash?
@mducle I've set off a build for lib_jemalloc=5.3.0
https://builds.mantidproject.org/job/build_packages_from_branch/803/
if it builds, it will be uploaded to our conda under the label jemalloc_update
so we can test it easily on idaaas
You can now install conda install -c mantid/label/jemalloc_update mantidworkbench
to test. I've tried it very briefly on ALF>Excitations Powder
, loaded some ALF data, opened instrument view, turned open GL on and switched to full 3D with no crash. Obviously we need more testing though.
@jhaigh0 Do you think it's worth testing it still behaves on a GPU instance as well (e.g. WISH > Single-crystal (GPU)) - if so I can give it a go?
@RichardWaiteSTFC Yes that's a good idea
It would also be good to see if it helps with issue #37370
Works OK on WISH GPU session (as expected) - followed instructions here, #37370 and #37252
Does this behavior still happen when you use the wrapper scripts that are installed? The are in ..../bin/mantidworkbench
. They are bash and currently do the following (in order):
-
LD_PRELOAD
libjemalloc - detect thinlinc and launch with
vglrun
(or local variant) - add
${CONDA_PREFIX}/bin:${CONDA_PREFIX}/lib:${CONDA_PREFIX}/plugins
to thePYTHON_PATH
<- probably unnecessary for installs - have an option to start through
gdb
(useful for debugging user issues) - start (with a bunch of environment variables set)
python ${CONDA_PREFIX}/bin/workbench $@
There is an additional issue hiding in here. Currently we get a segfault when starting mantidworkbench on rhel9 (fine on rhel7). Commenting out the lines from the launch script
# Define parameters for jemalloc
LOCAL_PRELOAD=/opt/anaconda/envs/mantid-dev/lib/libjemalloc.so.2
if [ -n "${LD_PRELOAD}" ]; then
LOCAL_PRELOAD=${LOCAL_PRELOAD}:${LD_PRELOAD}
fi
fixes it. Both rhel7 and rhel9 are using the same version of libjemalloc (v5.3.0, build hcb278e6_0).
With a little more investigation, it appears to be an issue using /usr/lib64/ld-linux-x86-64.so.2
which is outside of the conda tree. On rhel, ld
is provided by glibc
which is v2.17 on rhel7 and v2.34 on rhel9.
Also not that conda info
says (truncated) on rhel7
virtual packages : __archspec=1=skylake_avx512
__cuda=12.2=0
__glibc=2.17=0
__linux=3.10.0=0
__unix=0=0
and on rhel9
virtual packages : __archspec=1=cascadelake
__cuda=12.4=0
__glibc=2.34=0
__linux=5.14.0=0
__unix=0=0
Using the new package with libjemalloc=5.2.0
still causes a crash when using InstrumentView with 3D enabled on IDAaaS virtual machines without a physical graphics card running Rocky8 using software OpenGL. Running:
LD_PRELOAD=/envs/mantje32/lib/libjemalloc.so.2 python -m workbench.app.main --single-process
under gdb
gives the following stack trace:
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007fff8d931809 in execute_list () from /usr/lib64/dri/swrast_dri.so
#2 0x00007fff8d981e39 in _mesa_CallList () from /usr/lib64/dri/swrast_dri.so
#3 0x00007fff75f83391 in MantidQt::MantidWidgets::InstrumentRenderer::renderInstrument(std::vector<bool, std::allocator<bool> > const&, bool, bool) ()
from /envs/mantje32/plugins/qt5/../../lib/libMantidQtWidgetsInstrumentViewQt5.so
In addition, the following OpenGL errors were reported just before the crash:
OpenGL-[Error] OpenGL error detected in GLDisplay::setRenderingOptions(): invalid operationOpenGL-[Error]
OpenGL error detected in GL3DWidget::draw3D()[begin]: invalid operation
I'm not sure if these are red herrings though. I'm trying to hunt through the mesa code but it's looking pretty complicated...
Closed by #37437.