ROCm-OpenCL-Driver icon indicating copy to clipboard operation
ROCm-OpenCL-Driver copied to clipboard

clBuildProgram segv

Open pszi1ard opened this issue 6 years ago • 33 comments

The following change that only does code refectoring of the GROMACS OpenCL kernels causes the OpenCL compiler to crash: https://gerrit.gromacs.org/#/c/7810/19/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_utils.clh

The culprit has been isolated to the linked changes on line 675-677, the local memory stores that have been moved from the collar into the reduction function in question. If these three lines are commented out, the compilation succeeds.

pszi1ard avatar Sep 13 '18 18:09 pszi1ard

I am asking the team to look into this.

G

gstoner avatar Sep 13 '18 18:09 gstoner

Where can I get the exact kernel source, dependencies, and compiler options needed to reproduce the problem?

b-sumner avatar Sep 13 '18 18:09 b-sumner

The source code is here: https://gerrit.gromacs.org/changes/7810/revisions/f199e29cc958c00bd1481e710d9abdd0d36ae0f9/archive?format=tbz2

Warning: the review site serves a tar.bz2 which is a tarbomb (no root directory).

Extract, and from the build directory run:

cmake $SOURCE_DIR -DGMX_GPU=ON -DGMX_USE_OPENCL=ON &&\
make mdrun-test &&\ 
bin/mdrun-test

The mdrun-test unit test will segv as soon as it hits the compilation of the kernel(s) in src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel.clh.

pszi1ard avatar Sep 13 '18 23:09 pszi1ard

As this code in question is about to pass code review and is about to be merged which will prevent me from testing with ROCm, I'd be thankful if you can suggest an easy work-around that I can use until the compiler issue is fixed.

pszi1ard avatar Sep 13 '18 23:09 pszi1ard

I used ccmake to point precisely at the OpenCL_INCLUDE_DIR and OpenCL_LIBRARY that I want to use, and it tells me " OpenCL is not supported. OpenCL version 1.2 or newer is required."

Is it true that the INCLUDE_DIR I point to should contain a directory named CL containing cl.h... and that LIBRARY should be a file named libOpenCL.so? If so, what else does it want?

b-sumner avatar Sep 14 '18 00:09 b-sumner

Is it true that the INCLUDE_DIR I point to should contain a directory named CL containing cl.h... and that LIBRARY should be a file named libOpenCL.so?

That should be enough. It seems to get rid of the "sticky" error you need by starting over with a clean cache (pass -DOpenCL_INCLUDE_DIR and -DOpenCL_LIBRARY to cmake)

pszi1ard avatar Sep 14 '18 09:09 pszi1ard

Thanks. I am able to build and run.

What kind of device are you seeing the problem on? I am running using a debug build of the tip compiler on gfx803 and it looks like it's going to pass.

What versiion of ROCm are you running?

b-sumner avatar Sep 14 '18 15:09 b-sumner

... [ RUN ] MdrunCanWrite/NptTrajectories.WithDifferentPcoupl/2

NOTE 1 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp, line 13]: /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp did not specify a value for the .mdp option "cutoff-scheme". Probably it was first intended for use with GROMACS before 4.6. In 4.6, the Verlet scheme was introduced, but the group scheme was still the default. The default is now the Verlet scheme, so you will observe different behaviour.

Setting the LD random seed to 965332988 Generated 279 of the 1225 non-bonded parameter combinations Excluding 2 bonded neighbours molecule type 'Methanol' Excluding 2 bonded neighbours molecule type 'SOL' Removing all charge groups because cutoff-scheme=Verlet Number of degrees of freedom in T-Coupling group System is 12.00 Determining Verlet buffer for a tolerance of 0.005 kJ/mol/ps at 298 K Calculated rlist for 1x1 atom pair-list as 1.025 nm, buffer size 0.025 nm Set rlist, assuming 4x4 atom pair-list, to 1.022 nm, buffer size 0.022 nm Note that mdrun will redetermine rlist based on the actual pair-list setup

NOTE 2 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp]: You are using a plain Coulomb cut-off, which might produce artifacts. You might want to consider using PME electrostatics.

This run will generate roughly 0 Mb of data

There were 2 notes Reading file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2.tpr, VERSION 2019-dev (single precision) Changing nstlist from 10 to 100, rlist from 1.022 to 1.373

Using 1 MPI thread Using 1 OpenMP thread

1 GPU auto-selected for this run. Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node: PP:0

NOTE: Thread affinity was not set. starting mdrun 'spc-and-methanol' 2 steps, 0.0 ps.

Writing final coordinates.

           Core t (s)   Wall t (s)        (%)
   Time:        0.068        0.068      100.0
             (ns/day)    (hour/ns)

Performance: 3.786 6.339 [ OK ] MdrunCanWrite/NptTrajectories.WithDifferentPcoupl/2 (28402 ms) [----------] 3 tests from MdrunCanWrite/NptTrajectories (85248 ms total)

[----------] Global test environment tear-down [==========] 27 tests from 11 test cases ran. (817537 ms total) [ PASSED ] 27 tests.

YOU HAVE 45 DISABLED TESTS

b-sumner avatar Sep 14 '18 15:09 b-sumner

What kind of device are you seeing the problem on?

gfx803 and gfx900.

I am running using a debug build of the tip compiler on gfx803 and it looks like it's going to pass.

OK, but not sure what does that tell us? Wouldn't it still be useful to know if you can repro the issue. Is there a 1.8 patch release planned? Otherwise, ETA for 1.9?

What versiion of ROCm are you running?

$ dpkg -l | grep "rocm-"
ii  rocm-clang-ocl                         0.3.0-7997136                              amd64        OpenCL compilation with clang compiler.
ii  rocm-opencl                            1.2.0-2018082755                           amd64        OpenCL/ROCm
ii  rocm-opencl-dev                        1.2.0-2018082755                           amd64        OpenCL/ROCm
ii  rocm-smi                               1.0.0-46-g81ef66f                          amd64        System Management Interface for ROCm
ii  rocm-utils                             1.8.199                                    amd64        Radeon Open Compute (ROCm) Runtime software stack

pszi1ard avatar Sep 14 '18 15:09 pszi1ard

https://github.com/RadeonOpenCompute/ROCm/issues/404#issuecomment-421399027 says 1.9 will be releasing very soon. Since the problem is not showing up with the tip compiler, your issue was fixed sometime after 1.8 released. Hopefully it was picked up in 1.9.

b-sumner avatar Sep 14 '18 16:09 b-sumner

OK, looking forward to seeing the 1.9 not crash, but admittedly I'd be more relieved if somebody confirmed that the release branch is in fact fixed.

(Unrelated, but I'm hoping that 1.8 debs won't get pulled so I can down- and upgrade freely.)

pszi1ard avatar Sep 14 '18 16:09 pszi1ard

OK, looking forward to seeing the 1.9 not crash, but admittedly I'd be more relieved if somebody confirmed that the release branch is in fact fixed.

Though if 1.9 is indeed dropping today, it won't be a long wait.

pszi1ard avatar Sep 14 '18 16:09 pszi1ard

After updating the toolchain to ROCm 1.9, I am still getting a clBuildProgram() segfault, so unfortunately this seems to have fallen through the cracks.

How long until the next patch release?

pszi1ard avatar Sep 17 '18 18:09 pszi1ard

FWIW, I don't have a spare machine where I can fully install 1.9, but I pointed my LD_LIBRARY_PATH at a release build of the 1.9 OpenCL, HSA, and thunk shared objects and mdrun-test passed for me on gfx803. It says "1 GPU auto-selected for this run." so I assume it is running on the GPU.

b-sumner avatar Sep 18 '18 15:09 b-sumner

I still see it fail on my ROCm 1.9 system with a Vega (gfx900) and Fiji (gfx803) installed. OpenCL driver version 2679.0, so I believe this is a full 1.9 install.

Currently unable to build a debug release of the OpenCL runtime to get symbols, or I'd point where the issue is coming up for me.

jlgreathouse avatar Sep 18 '18 17:09 jlgreathouse

Do you know if it is trying to build programs for both devices? Maybe the build is faulting when trying to build for vega?

b-sumner avatar Sep 18 '18 19:09 b-sumner

Just putting my commands down here so I don't need to arrow-up every time I want to run this test. :)

mkdir -p ~/gromacs_test/
cd ~/gromacs_test/
wget https://gerrit.gromacs.org/changes/7810/revisions/f199e29cc958c00bd1481e710d9abdd0d36ae0f9/archive?format=tbz2
mv archive\?format\=tbz2 gromacs.tar.bz2
tar -xf gromacs.tar.bz2
SOURCE_DIR=$(pwd)
mkdir build
cd build
cmake $SOURCE_DIR -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DGMX_BUILD_OWN_FFTW=ON
make -j `nproc` mdrun-test
cd bin
./mdrun-test

Re-tested on ROCm 1.9 on a system with only Polaris 10 (gfx803):

$ rocm_agent_enumerator
gfx000
gfx803
$ dkms status
amdgpu, 1.9-211, 4.15.0-34-generic, x86_64: installed
$ clinfo | grep Driver
  Driver version:                                2679.0 (HSA1.1,LC)
./mdrun-test
...
1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0

NOTE: Thread affinity was not set.
Segmentation fault (core dumped)

gdb backtrace (no symbols in libamdocl64.so at the moment)

Thread 1 "mdrun-test" received signal SIGSEGV, Segmentation fault.
0x00007fffeead21ed in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
(gdb) bt
#0  0x00007fffeead21ed in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#1  0x00007fffeead4b1a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007fffeebbb1cd in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007fffeead3f35 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#4  0x00007fffeead61f8 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#5  0x00007fffeead8a1b in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6  0x00007fffeead9096 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7  0x00007fffeec7b227 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8  0x00007fffef5a506a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9  0x00007fffef2a29df in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007fffef5a5a7e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007fffed4fa465 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007fffed4fc94d in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007fffed4f11ac in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#14 0x00007fffed7f7f3e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#15 0x00007fffed8051dd in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#16 0x00007fffed4e07da in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#17 0x00007fffed4e9522 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#18 0x00007fffed4e9938 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#19 0x00007fffed3b5b0d in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#20 0x00007fffed3e0051 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#21 0x00007fffed3b3ccf in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#22 0x00007fffed3a5b3e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#23 0x00007fffed37bcb9 in clBuildProgram () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#24 0x00007ffff6fb5714 in gmx::ocl::compileProgram(_IO_FILE*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _cl_context*, _cl_device_id*, ocl_vendor_id_t) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#25 0x00007ffff6f5f2bb in nbnxn_gpu_compile_kernels(gmx_nbnxn_ocl_t*) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#26 0x00007ffff6f5cb8a in nbnxn_gpu_init(gmx_nbnxn_ocl_t**, gmx_device_info_t const*, interaction_const_t const*, NbnxnListParameters const*, nbnxn_atomdata_t const*, int, bool) ()
   from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#27 0x00007ffff6e9695c in init_forcerec(_IO_FILE*, gmx::MDLogger const&, t_forcerec*, t_fcdata*, t_inputrec const*, gmx_mtop_t const*, t_commrec const*, float (*) [3], char const*, char const*, gmx::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, gmx_hw_info_t const&, gmx_device_info_t const*, bool, float) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#28 0x00007ffff6feade7 in gmx::Mdrunner::mdrunner() () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#29 0x00005555555cb9e8 in gmx::Mdrunner::mainFunction(int, char**) ()
#30 0x00005555555cc8ef in gmx_mdrun(int, char**) ()
#31 0x00005555555be6ff in gmx::test::SimulationRunner::callMdrun(gmx::test::CommandLine const&) ()
#32 0x0000555555586a03 in gmx::test::ImdTest_ImdCanRun_Test::TestBody() ()
#33 0x000055555564bdba in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#34 0x000055555563e761 in testing::Test::Run() [clone .part.533] ()
#35 0x000055555563f1f5 in testing::TestInfo::Run() [clone .part.534] ()
#36 0x000055555563f535 in testing::TestCase::Run() [clone .part.535] ()
#37 0x0000555555641d35 in testing::internal::UnitTestImpl::RunAllTests() [clone .part.549] ()
#38 0x0000555555642192 in testing::UnitTest::Run() ()
#39 0x0000555555573089 in main ()

jlgreathouse avatar Sep 18 '18 19:09 jlgreathouse

Sigh. I really do not like rpath. My build of libgromacs.so.4.0.0 has an rpath pointing to the directory containing my internal opencl bits. I see the segv now after hitting it with "chrpath -d".

b-sumner avatar Sep 18 '18 21:09 b-sumner

So the good news and bad news for @pszi1ard:

  • Bad news: This bug isn't fixed in the public OpenCL release as of ROCm 1.9.
  • Good news: as our confusion may have demonstrated, we definitely have this bug fixed internally. So all hope is not lost. :)

I don't think we will be able to to give you a solid timeline for when this will make it into an external release, as we are still working out patches may enter into any bugfix point release in 1.9.x.

jlgreathouse avatar Sep 19 '18 01:09 jlgreathouse

Thanks for the feedback @b-sumner and @jlgreathouse.

I appreciate that the planning for 1.9.x is ongoing. Is there a (reasonably straightforward!) way for me to build and replace the rocm-opencl packages so I can keep using ROCm?

Do you happen to have a suggestion for a non-invasive transformation on the code to tickle the compiler and avoid the segv?

Otherwise, the side-effect of putting aside any testing/dev with ROCm right now is that, no GROMACS development and testing can be done for the upcoming 2019 release (change freeze in about a month) on ROCm until fixes land. This poses the risk that our next release will not work at all on ROCm or at least we'l have to keep warning users against using ROCm at worst and untested / un-tuned code at best if a fix comes before the final release later this year.

Side-note: @b-sumner Indeed gmx (and libgromacs.so) produced relocatable binaries

  RPATH                $ORIGIN/../lib

but not much more. You can avoid this by passing -DCMAKE_SKIP_RPATH.

pszi1ard avatar Sep 20 '18 14:09 pszi1ard

The 1.9 compiler was branched on June 21. The failure being hit is in a "register coalescer" which was subsequently updated after the branch. I'm not really sure how to perturb the code to affect something that deep. One possibility might be to reduce each component separately instead of all 3 at once.

b-sumner avatar Sep 20 '18 17:09 b-sumner

One possibility might be to reduce each component separately instead of all 3 at once.

Thanks for the tip. Unfortunately it did not work. Strange thing is that if I completely the 2nd, conditional atomic reduction (also concurrent of three values), the crash is gone. However, if I issue both the in-loop and outside-of-loop atomic ops sequentially, I still get the crash. Any other tips? :)

On a different note: Is it a reasonably straightforward thing (and workable idea) to build a rocm-opencl deb package from source and replace the current one? Where would I start -- is the bug fixed in the 1.9 release branch?

pszi1ard avatar Sep 24 '18 23:09 pszi1ard

Hi @pszi1ard

IMO, it's pretty easy to build a custom install of the OpenCL runtime. See my post in this other issue which includes a shell script that will do basically everything for you. You might have to play around with it to build a .deb package -- to properly cpack the results, for example. You might also want to change the build to Release from ReleaseWithDebInfo.

You could pull the pre- and post-install files out of the existing ROCm OpenCL .deb file if you want to make things slightly easier.

jlgreathouse avatar Sep 24 '18 23:09 jlgreathouse

I'll have to defer to @b-sumner about whether this bug is fixed in in the 1.9.0 source code release branch of the open source OpenCL runtime. I believe our original intent was that the source code release would match the source we used to build the .deb releases.

That said, I see that the roc-1.9.x branch of LLVM that the OpenCL build direction pulls from had some patches regarding register allocation brought into it last week. I don't know if this was meant to fix the issue being raised here or not.

jlgreathouse avatar Sep 24 '18 23:09 jlgreathouse

It seems like this missed the 1.9.1 too, right? Any plans to release a new rocm-opencl?

pszi1ard avatar Nov 06 '18 12:11 pszi1ard

There likely won't be a major update to rocm-opencl until ROCm 2.0.

jlgreathouse avatar Nov 06 '18 17:11 jlgreathouse

That's very unfortunate. Is there an approx ETA for ROCm 2.0 (I know it has just been announced, but announcement != release and we need to advise out users whether to stay clear of ROCm or not) ?

pszi1ard avatar Nov 06 '18 17:11 pszi1ard

I apologize for the long delay on this. I believe the bugfix is part of a larger series of changes in the compiler. We didn't want to bring new functionality into a 1.9.x point release, but it also would have been difficult to cherry pick this individual fix back into the 1.9.x code base.

I believe that our target is for 2.0 to be out by the end of the year, but i'm not sure if AMD has made a public announcement on an official exact date.

jlgreathouse avatar Nov 06 '18 19:11 jlgreathouse

@jlgreathouse OK, I understand that the cost of backporting a fix is too high to address this issue.

End of the year would be great, I hope I can get a confirmation soon.

pszi1ard avatar Nov 06 '18 22:11 pszi1ard

We are getting close to our final release and at the moment 1.9 is still not working. I've seen some links to 2.0 beta rpms floating around on the Internet, but I'm not aware of a deb repo. Have your internal testing covered my bog report? Do you have 2.0 beta debs available?

With the latest 1.9.x, additionall to the previous 100% reproducible segv, I also see a clBuildProgram segv in ~0.1-0.2% of the compilations when building mostly (only?) when building clFFT. Do you know of such issues?

pszi1ard avatar Dec 12 '18 23:12 pszi1ard