tests.unit.python.execution_tree.eval test fails on POWER8/Clang
The Release build call stack is massive (318 functions deep) and the test fails this way:
[khuck@centaur phylanx-Release]$ gdb --args /usr/local/packages/python3/3.6.3/bin/python3 "/home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tests/unit/python/execution_tree/eval.py"
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /storage/packages/python3/3.6.3/bin/python3.6...done.
(gdb) run
Starting program: /usr/local/packages/python3/3.6.3/bin/python3 /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tests/unit/python/execution_tree/eval.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Detaching after fork from child process 148026.
Detaching after fork from child process 148031.
[New Thread 0x3fffa815e990 (LWP 148036)]
[New Thread 0x3fffa762e990 (LWP 148037)]
[New Thread 0x3fffa5d6e990 (LWP 148038)]
[New Thread 0x3fffa555e990 (LWP 148039)]
[New Thread 0x3fffa4d4e990 (LWP 148040)]
[New Thread 0x3fff8fffe990 (LWP 148041)]
[New Thread 0x3fff8f7ee990 (LWP 148042)]
[New Thread 0x3fff8efde990 (LWP 148043)]
[New Thread 0x3fff8e7ce990 (LWP 148044)]
[New Thread 0x3fff8dfbe990 (LWP 148045)]
[Thread 0x3fffa4d4e990 (LWP 148040) exited]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fff8dfbe990 (LWP 148045)]
0x00003fffb7fca7ac in _dl_update_slotinfo () from /lib64/ld64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.ppc64le elfutils-libelf-0.170-4.el7.ppc64le elfutils-libs-0.170-4.el7.ppc64le glibc-2.17-222.el7.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-19.el7.ppc64le libattr-2.4.46-13.el7.ppc64le libcap-2.22-9.el7.ppc64le libcom_err-1.42.9-12.el7_5.ppc64le libffi-3.0.13-18.el7.ppc64le libicu-50.1.2-15.el7.ppc64le libselinux-2.5-12.el7.ppc64le libselinux-2.5-6.el7.ppc64le openssl-libs-1.0.2k-12.el7.ppc64le pcre-8.32-17.el7.ppc64le systemd-libs-219-57.el7.ppc64le xz-libs-5.2.2-1.el7.ppc64le zlib-1.2.7-17.el7.ppc64le
(gdb) bt
#0 0x00003fffb7fca7ac in _dl_update_slotinfo () from /lib64/ld64.so.2
#1 0x00003fffb7fb16f0 in update_get_addr () from /lib64/ld64.so.2
#2 0x00003fffaeee2350 in hpx::threads::coroutines::detail::coroutine_self::get_self() ()
from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#3 0x00003fffaf03ae18 in hpx::threads::get_self_ptr() ()
from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#4 0x00003fffb02561e8 in hpx::util::annotate_function::annotate_function(char const*) ()
from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#5 0x00003fffb02546f4 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#6 0x00003fffb021e270 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#7 0x00003fffb00b1fe0 in hpx::actions::basic_action_impl<hpx::lcos::future<phylanx::execution_tree::primitive_argument_type> (phylanx::execution_tree::primitives::primitive_component::*)(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const, hpx::lcos::future<phylanx::execution_tree::primitive_argument_type> (phylanx::execution_tree::primitives::primitive_component::*)(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const, &(phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const), phylanx::execution_tree::primitives::primitive_component::eval_action>::invoke_helper<hpx::lcos::future<phylanx::execution_tree::primitive_argument_type>, std::vector<phylanx::---Type <return> to continue, or q <return> to quit---
The Debug build fails in a different location, but with an equally massive call stack (in the ~436 range). In the Debug build, it appears a boost "unused" type is passed as an attribute/context somewhere deep in boost:
(gdb) bt
#0 0x00003fffaf64dd44 in boost::fusion::vector<>::vector() (this=0x3fff96eb0180)
at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/fusion/container/vector/vector.hpp:288
#1 0x00003fffaf64da64 in boost::spirit::context<boost::fusion::cons<boost::spirit::unused_type&, boost::fusion::nil_>, boost::fusion::vector<> >::context(boost::spirit::unused_type&) (this=0x3fff96eb0160,
attribute=...)
at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/support/context.hpp:101
#2 0x00003fffaf65a498 in boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type>::parse<boost::spirit::unused_type const, boost::spirit::unused_type, boost::spirit::unused_type const> (this=0x3fff96ebfe70, first=110 'n',
last=0 '\000', skipper=..., attr_param=...)
at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/nonterminal/rule.hpp:298
#3 0x00003fffaf65a3cc in boost::spirit::qi::reference<boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type> const>::parse<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type const, boost::spirit::unused_type, boost::spirit::unused_type const> (this=0x3fff96ebfc38, first=110 'n', last=0 '\000', context=..., skipper=..., attr_=...)
at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/reference.hpp:43
#4 0x00003fffaf653b24 in boost::spirit::qi::skip_over<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::qi::reference<boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type> const> > (first=110 'n', last=0 '\000', skipper=...)
at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/skip_over.hpp:27
#5 0x00003fffaf6e8778 in boost::spirit::qi::lexeme_directive<boost::spirit::qi::sequence<boost::fusion::cons<---Type <return> to continue, or q <return> to quit---
The Release and Debug errors seem to be unrelated. While the release error comes out of an actual action invocation, the Debug error happens in Spirit during parsing (presumably a PhySL expression). Both actually could be stack overflows :/
I think I eliminated the stack overflow issue by doubling the stack size (changing ulimit -s) and getting the same crash, in the same location.
@khuck I don't think ulimit -s has any bearings on the stack size used by HPX for its threads. I wouldn't rule out a stack overflow for this problem.
OK, trying with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On
Running with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On didn't change anything - it still crashed in roughly the same location, but slightly different:
#0 0x00003fffaf14cdd0 in hpx::threads::thread_data::set_description(hpx::util::thread_description) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#1 0x00003fffaf149d4c in hpx::threads::set_thread_description(hpx::threads::thread_id_type const&, hpx::util::thread_description const&, hpx::error_code&) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#2 0x00003fffb027ce40 in hpx::util::annotate_function::annotate_function(char const*) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#3 0x00003fffb027b294 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#4 0x00003fffb0244a90 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
Could it be that there is an operation that is just missing an annotation, or is getting mis-annotated in some way?
After compiling with Clang 6.0 on an x86_64 machine, I think I confirmed it's a POWER8-specific problem. Is there something specific about this particular primitive that does something unusual?
@hkaiser - another clue... as you pointed out, the crash is in:
#367 0x00003fffb0609664 in phylanx::bindings::expression_evaluator(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, phylanx::bindings::compiler_state&, pybind11::args)::{lambda()#1}::operator()() const (this=0x3fffffffcd88)
at /home/users/khuck/src/phylanx/python/src/bindings/binding_helpers.hpp:181
...but the expression it is parsing is not that crazy:
181 auto xexpr = phylanx::ast::generate_ast(xexpr_str);
(gdb) print xexpr_str
$1 = "\nblock(\n define(fib,n,\n if(n<2,n,\n fib(n-1)+fib(n-2))),\n fib)"
except for the fact that it is a recursive definition.
Also, this didn't crash when I built it on an x86_64 machine (HPX and Phylanx were built with Clang 5.0) that used the ubuntu boost package (built by GCC, I assume). Whereas the machine that is crashing was using a boost built by clang 5.0.
Also, I built the test that crashed with -fstack-protector-all -fstack-protector-strong and didn't see any difference.
@hkaiser The stack stuff might be a red herring. I have another clue. I tried running a RelWithDebInfo build. It crashes, but in a different way. 2 steps up the stack, the program is in the "eval" method of the primitive_component base class. When I dereference the "this" pointer, I get this back:
#2 0x00003fffb0259fa0 in phylanx::execution_tree::primitives::primitive_component::eval (this=0x10fc05a0,
params=..., mode=<optimized out>)
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component.cpp:123
123 return primitive_->do_eval(params, mode);
(gdb) print this
$10 = (const phylanx::execution_tree::primitives::primitive_component *) 0x10fc05a0
(gdb) print *this
$11 = {<hpx::components::component_base<phylanx::execution_tree::primitives::primitive_component>> = {<hpx::components::detail::base_component> = {<hpx::traits::detail::component_tag> = {<No data fields>}, gid_ = {
static credit_base_mask = 31, static credit_shift = 24, static credit_mask = 520093696,
static was_split_mask = 2147483648, static has_credits_mask = 1073741824,
static is_locked_mask = 536870912, static locality_id_mask = 18446744069414584320,
static locality_id_shift = 32, static virtual_memory_mask = 4194303,
static dont_cache_mask = 8388608, static is_migratable = 4194304, static dynamically_assigned = 1,
static component_type_base_mask = 1048575, static component_type_shift = 1,
static component_type_mask = 2097150, static credit_bits_mask = 3741319168,
static internal_bits_mask = 4290772992, static special_bits_mask = 18446744073707454462,
id_msb_ = 4294967376, id_lsb_ = 284951968}}, <No data fields>}, primitive_ = warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 1, weak 0) 0x10fb77e0}
...which seems OK, except for the RTTI warning. Then, stepping down the stack things get interesting:
(gdb) down
#1 0x00003fffb028adb4 in phylanx::execution_tree::primitives::primitive_component_base::do_eval (
this=0x10fb77e0, params=std::vector of length 1, capacity 1 = {...},
mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component_base.cpp:89
89 auto f = this->eval(params, mode);
which also seems OK. but then taking one more step, into the concrete instance of the object:
(gdb) down
#0 0x00003fffb009d230 in phylanx::execution_tree::primitives::access_function::eval (this=0x0,
params=std::vector of length 1, capacity 1 = {...},
mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
at /home/users/khuck/src/phylanx/src/execution_tree/primitives/access_function.cpp:57
57 {
...you'll notice the "this" pointer is null! So for some reason, this object is either corrupted, or...? Is something missing from the implementation of phylanx::execution_tree::primitives::access_function so that it isn't getting handled like the other primitives?
@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this<access_function> like the other primitives. Could that be the case?
@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this<access_function> like the other primitives. Could that be the case?
@khuck: I don't think this causes the issue we're seeing. All primitives are kept alive by a shared_ptr in any case, most of them however additionally need to stay alive for 'delayed' operation (requiring the enable_shared_from_this), access_variable is not one of those, iirc.
@hkaiser ok. I started playing with the code in eval.py, and it's crashing on the definition of fib10 (compressed here):
fib10 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 10)
BUT if I change it to fib9, it works:
fib9 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 9)
...and the same is true of the fib() function defined later, if I call it with fib(9) it's OK, but fib(10) crashes. So it is stack related, but it's the stack of the AST that is the problem. Reminder, this is Clang 5.0 on POWER8, so different beast than GCC on x86_64.
Yup, stack related. This issue will stay open, but a work-around for that platform has been committed. See pull request #601
This PR enables stack overflow prevention in HPX on Power platforms: https://github.com/STEllAR-GROUP/hpx/pull/3469. Please verify.
@khuck I believe the calculation of the remaining amount of stack space in my original patch was wrong. Could you try again, please?
@hkaiser nope, same error. I have asked for someone to send me the instructions for getting an account on our system if you want to test it yourself...
@khuck can you try a build with address sanitizer? This is usually very accurate in pinpointing to issues
@sithhell I did. I ran into so many linker issues I couldn't figure out how to fix them. I tried with valgrind, but after 3-4 hours building a suppression file, I was no closer to the cause of the problem.
@khuck for the linker errors, configure your HPX build with -DHPX_WITH_SANITIZERS=On. This should solve most of them.
@sithhell IIRC, building HPX wasn't the problem, but building Phylanx was. The address sanitizer library was supposed to be first in the link order, but it wasn't. Besides, I built Clang 5.0 myself for this machine, and it's possible I didn't configure/build the sanitizer libraries correctly.
@sithhell @khuck it would be great to have a docker image with the address sanitizer enabled and working correctly.
@sithhell yes it would - are you volunteering? :)
OK, I attempted to make a Phylanx docker image that uses sanitize. I fail at the Phylanx link step. Here's the Dockerfile I attempted to use
https://gist.github.com/stevenrbrandt/56cc36a9c9cb0375ae264c398d0e3431
Setting -lasan in CMAKE_EXE_LINKER for Phylanx seems to do nothing.
However, setting -lasan in CMAKE_CXX_FLAGS allows Phylanx to link works - though it gives a bunch of spurious warning messages about using a link flag while not linking.
Regardless, however, I can't run bin/physl because I get this error:
build]# bin/physl --doc
==15==Your application is linked against incompatible ASan runtimes.
Not sure how that comes about, since I only have the default Clang / libasan installed.
@sithhell Any idea what I'm doing wrong?
Address Sanitizer is really temperamental sometimes… Instead of adding -lasan, can you add the specific library? i.e. /path/to/compiler/lib/libasan.so instead of -lasan to make sure you get the right one.
Kevin
@khuck I've discovered the -shared-libasan flag. I'm experimenting with that.
@khuck @sithhell Current Dockerfile: https://gist.github.com/stevenrbrandt/27e1d4eb5fd86a4b57697567c3964697
Ok, this uses -shared-libasan and everything compiles, but when I try to run Phylanx Hello World, I get this:
==27==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==27==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==27==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==27==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==27==Process memory map follows:
0x000000400000-0x0000007be000 /usr/bin/python3.6
0x0000009bd000-0x0000009be000 /usr/bin/python3.6
0x0000009be000-0x000000a5b000 /usr/bin/python3.6
0x000000a5b000-0x000000a8f000
Not sure what to do at this point.
@stevenrbrandt just curious - are you using the system allocator or tcmalloc/jemalloc?
@stevenrbrandt also, what happens if you run an example without python involved? like lra_csv or something like that?
@khuck I'm using the System Allocator, see the docker file I linked.
You can't even run "physl --doc" without problems:
# bin/physl --doc
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot instantiate more than one affinity data instance
Aborted
So, a small success (I think). The problem seems to have partly been the 80 core cluster I built it on...
Running on a smaller machine, I get this. You can try out stevenrbrandt/phylanx.sanitized from Docker yourself.
# ./bin/physl --doc
=================================================================
==27==ERROR: AddressSanitizer: odr-violation (0x7fb8b739c940):
[1] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
[2] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
These globals were registered at these points:
[1]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b46582dd in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_solversd.so+0x39b2dd)
[2]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b6e14f7d in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_arithmeticsd.so+0x2503f7d)
==27==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'hpx::util::detail::global_fixture' at /hpx/src/util/lightweight_test.cpp:56:13
==27==ABORTING
Did you try export ASAN_OPTIONS=detect_odr_violation=0 before running?