librascal icon indicating copy to clipboard operation
librascal copied to clipboard

Memory corruption related to march=native

Open Luthaf opened this issue 3 years ago • 3 comments

The use of -march=native by default is not a good idea: https://github.com/cosmo-epfl/librascal/blob/1ae879cc1b1b970fc691cdba9b80cd3fcf4fadee/CMakeLists.txt#L94-L100

It can corrupt memory when using librascal as an external project. For example, compiling and running examples/spherical_invariants_example.cc externally with

clang++ -std=c++14 -O3 spherical_invariants_example.cc -o spherical_invariants_example -I../src -I/usr/local/include/eigen3 -I../build/external/wigxjpf/inc ../build/src/librascal.a ../build/external/wigxjpf/lib/libwigxjpf.a
./spherical_invariants_example ../reference_data/inputs/molecular_crystal.json

Results in

spherical_invariants_example(79937,0x1146bc5c0) malloc: *** error for object 0x7ff1ba702200: pointer being freed was not allocated
spherical_invariants_example(79937,0x1146bc5c0) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      ./spherical_invariants_example ../reference_data/inputs/molecular_crystal.jso

Adding -march=native to the above command line fixes the issue and everything works.

The origin of the problem is that some code in librascal.a/librascal.so contains code assuming one instruction set but the main binary assumes a separate instruction set. Different instructions set can have different ABI, and this results in caller/calle code not agreeing on where arguments should be passed. This is partially exacerbated by the fact that librascal is mostly a header library, meaning the end user will compile most of the code themselves, meaning they MUST match the exact compiler flags used to build librascal.


Use of -march=native by default can also be a problem on some HPC cluster, where the head and compute nodes use different instruction sets (e.g. Broadwell vs Haswell for Intel), leading to binaries that run on the head node and crash on the compute nodes.


Overall, I think we should remove this default of -march=native, and evaluate the performance difference from doing so. If there is a large performance difference, then we can add suggestions in the documentation that people use if when building librascal & their own code.

Luthaf avatar Apr 08 '21 10:04 Luthaf

I am not quite sure how to label this issue but it certainly is not a bug, merely something that might hinder portability. I would keep this flag by default unless it is clear that it does not hit performances too much, especially on CPUs that have avx-512 instructions.

Use of -march=native by default can also be a problem on some HPC cluster, where the head and compute nodes use different instruction sets (e.g. Broadwell vs Haswell for Intel), leading to binaries that run on the head node and crash on the compute nodes.

I would recommend any HPC user to compile on the target architecture in any case. So one might want to compile using the debug queue for instance.

felixmusil avatar Apr 08 '21 12:04 felixmusil

I fully disagree. There is more here than portability, i.e. possible memory corruption bugs as demonstrated above.

Also this creates a very bad user experience, where the code breaks for relatively obscure and hard to debug reasons. The memory corruption ending in the pointer being freed was not allocated error above is impossible to track through gdb/lldb (which are already advanced tools for scientific programmers). If we want people from outside COSMO to be able to use librascal, we need to be serious about addressing pain points for them.

I would keep this flag by default unless it is clear that it does not hit performances too much, especially on CPUs that have avx-512 instructions.

From small tests I ran locally, there is not much of the a change in SOAP power spectrum performance; but I agree this would need better benchmarks. Although we should keep in mind that AVX-512 is not a magic wand that makes everything faster: there is a high cost from switching from SSE2 code to AVX-512 code (lot of register to save/restore); and Intel CPU tends to throttle when running AVX-512 code since it outputs a lot of heat.

Luthaf avatar Apr 08 '21 13:04 Luthaf

I fully disagree. There is more here than portability, i.e. possible memory corruption bugs as demonstrated above.

Also this creates a very bad user experience, where the code breaks for relatively obscure and hard to debug reasons. The memory corruption ending in the pointer being freed was not allocated error above is impossible to track through gdb/lldb (which are already advanced tools for scientific programmers). If we want people from outside COSMO to be able to use librascal, we need to be serious about addressing pain points for them.

Exactly, and that's why this is a bug (added the shiny new "compilation" label to make it clearer what type of bug this is). It's a potentially catastrophic failure caused by compilation with the default settings - if we want anyone to seriously consider using librascal in their projects in the future, we need to treat such issues seriously.

(Also removed the "low-priority" label -- it's reserved for issues that aren't causing big problems or holding up PRs. But as you can see from #329, it is holding up a pretty important PR, so it's pretty high priority)

max-veit avatar Apr 08 '21 20:04 max-veit