compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

neo_page_fault_manager_tests randomly fails

Open stef opened this issue 4 years ago • 22 comments

still trying to package this up for alpine linux, as reported also in #277 (after fixing it) i have neo_page_fault_manager_tests randomly failing, sometimes it succeeds, other times it just fails, see this sample execution:

while ./neo_page_fault_manager_tests ; do : ; done
set timeout to: 45   
Iteration: 1         
Iteration: 1. random_seed: 0
                                                   
=====================
==   ULTs PASSED   ==
=====================
Tests run:      16  
Tests passed:   16   
Tests skipped:  0 
Tests failed:   0
Tests disabled: 0           
 Time elapsed:  0 ms
=====================                                                                                                                                                                                        
set timeout to: 45                        
Iteration: 1   
Iteration: 1. random_seed: 0
                                                                                                      
=====================                                                                                                                                                                                        
==   ULTs PASSED   ==
=====================                                                                                                                                                                                        
Tests run:      16
Tests passed:   16
Tests skipped:  0           
Tests failed:   0
Tests disabled: 0                                                                                                                                                                                            
 Time elapsed:  1 ms                      
=====================
set timeout to: 45
Iteration: 1                                                                                          
Iteration: 1. random_seed: 0                                                                                                                                                                                 
                                                   
./compute-runtime-20.08.15750/core/unit_tests/page_fault_manager/linux/cpu_page_fault_manager_linux_tests.cpp:23: Failure                                       
Value of: pageFaultManager->handlerInvoked
  Actual: false
Expected: true              
[  FAILED  ][ SKL ][ 0 ] PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked

i tried to reproduce this in gdb to figure out what happens, however these tests never fail in gdb, making this a challenging heisenbug.

do you dear people have any idea why this does not consistently fails or succeeds but exhibits this random behaviour? and if not what could i try to debug this?

stef avatar Mar 22 '20 16:03 stef

This is a really simple test which registers custom signal handler using sigaction, then raises SIGSEGV and checks if our handler is being called.

LukaszJobczyk avatar Mar 23 '20 07:03 LukaszJobczyk

yeah, that's what i figured too, i thought maybe you do some random magic deeper in your mock/unit testing framework.

stef avatar Mar 23 '20 11:03 stef

@stef does this tests starts to fail after some of your local changes OR it fails randomly on driver without any modifications ?

MichalMrozek avatar Mar 23 '20 11:03 MichalMrozek

without my patches i cannot compile on alpine linux, so yeah i have these patches applied: https://github.com/aports-ugly/aports/tree/master/ugly/compute-runtime

stef avatar Mar 23 '20 11:03 stef

i forgot to also note the modification that is mentioned in #277 supporting mmap without hugetlb, but i suspect that might be unrelated here.

stef avatar Mar 23 '20 11:03 stef

The test that fails is actually quite simple

TEST_F(PageFaultManagerLinuxTest, whenPageFaultIsRaisedThenHandlerIsInvoked) { auto pageFaultManager = std::make_unique<MockPageFaultManagerLinux>(); EXPECT_FALSE(pageFaultManager->handlerInvoked); std::raise(SIGSEGV); EXPECT_TRUE(pageFaultManager->handlerInvoked); }

If it fails it means that handler was not invoked. Can you try to debug this with some other techniques to see why it happens? (like using printfs)

We should call in here on std::raise bool verifyPageFault(void *ptr) override { handlerInvoked = true; if (allowCPUMemoryAccessOnPageFault) { this->allowCPUMemoryAccess(ptr, size); } return returnStatus; }

MichalMrozek avatar Mar 23 '20 11:03 MichalMrozek

yeah, i also tried to DISABLE_ prefix that testname and see if the other tests succeed consistently, but no. i'll try some printf tracing.

stef avatar Mar 23 '20 12:03 stef

so i put trace output in the constructor, destructor and PageFaultManagerLinux::pageFaultHandlerWrapper(). when everything goes right it gives this output:

set timeout to: 45
Iteration: 1
Iteration: 1. random_seed:
installed sighandler: 0
called sighandler
uninstalled sighandler
installed sighandler: 0
called sighandler
uninstalled sighandler
installed sighandler: 0
called sighandler
uninstalled sighandler
installed sighandler: 0
called sighandler
uninstalled sighandler
installed sighandler: 0
uninstalled sighandler
installed sighandler: 0
uninstalled sighandler

=====================
==   ULTs PASSED   ==
=====================
Tests run:      16
Tests passed:   16
Tests skipped:  0
Tests failed:   0
Tests disabled: 0
 Time elapsed:  0 ms
=====================

and when it fails it gives this output:

set timeout to: 45    
Iteration: 1           
Iteration: 1. random_seed: 0
installed sighandler: 0
                                                   
/home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/core/unit_tests/page_fault_manager/linux/cpu_page_fault_manager_linux_tests.cpp:23: Failure
Value of: pageFaultManager->handlerInvoked
  Actual: false        
Expected: true   
uninstalled sighandler
[  FAILED  ][ SKL ][ 0 ] PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked
installed sighandler: 0

stef avatar Mar 23 '20 17:03 stef

Is this a sporadic problem ? Can you check if handler is properly intialized here:

PageFaultManagerLinux::PageFaultManagerLinux() { pageFaultHandler = [&](int signal, siginfo_t *info, void *context) { if (!this->verifyPageFault(info->si_addr)) { callPreviousHandler(signal, info, context); } };

struct sigaction pageFaultManagerHandler = {};
pageFaultManagerHandler.sa_flags = SA_SIGINFO;
pageFaultManagerHandler.sa_sigaction = pageFaultHandlerWrapper;
auto retVal = sigaction(SIGSEGV, &pageFaultManagerHandler, &previousHandler);
UNRECOVERABLE_IF(retVal != 0);

}

Can you put printf in lambda to see if it is properly called when exception is thrown ?

MichalMrozek avatar Mar 27 '20 13:03 MichalMrozek

Is this a sporadic problem ?

yeah it's non-deterministic. how non-deterministic? i just ran this:

fail=0; for a in $(seq 1000); do ./neo_page_fault_manager_tests || fail=$((fail+1)) ; done ; echo "fail" $fail

and it gave me this result: 483/1000

currently i have these fprintfs in the code:

PageFaultManagerLinux::PageFaultManagerLinux() {
    pageFaultHandler = [&](int signal, siginfo_t *info, void *context) {
        fprintf(stderr,"in lambda before prev handler\n");
        if (!this->verifyPageFault(info->si_addr)) {
            callPreviousHandler(signal, info, context);
        }
        fprintf(stderr,"in lambda after prev handler\n");
    };

    struct sigaction pageFaultManagerHandler = {};
    pageFaultManagerHandler.sa_flags = SA_SIGINFO;
    pageFaultManagerHandler.sa_sigaction = pageFaultHandlerWrapper;
    auto retVal = sigaction(SIGSEGV, &pageFaultManagerHandler, &previousHandler);
    fprintf(stderr, "installed sighandler: %d\n", retVal);
    UNRECOVERABLE_IF(retVal != 0);
}

PageFaultManagerLinux::~PageFaultManagerLinux() {
    if (!previousHandlerRestored) {
        auto retVal = sigaction(SIGSEGV, &previousHandler, nullptr);
        UNRECOVERABLE_IF(retVal != 0);
    }
    fprintf(stderr, "uninstalled sighandler\n");
}

void PageFaultManagerLinux::pageFaultHandlerWrapper(int signal, siginfo_t *info, void *context) {
    fprintf(stderr, "called sighandler\n");
    pageFaultHandler(signal, info, context);
}

when it fails this is the output:

set timeout to: 45
Iteration: 1
Iteration: 1. random_seed: 0
installed sighandler: 0

/home/s/tasks/aports/ugly/compute-runtime/src/compute-runtime-20.08.15750/core/unit_tests/page_fault_manager/linux/cpu_page_fault_manager_linux_tests.cpp:23: Failure
Value of: pageFaultManager->handlerInvoked
  Actual: false
Expected: true
uninstalled sighandler
[  FAILED  ][ SKL ][ 0 ] PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked
installed sighandler: 0
zsh: segmentation fault (core dumped)  ./neo_page_fault_manager_tests

since it says: installed sighandler: 0 it means that sigaction returns without error.

stef avatar Mar 27 '20 14:03 stef

for reference: this is the test code that fails:

TEST(PageFaultManagerLinuxTest, whenPageFaultIsRaisedThenHandlerIsInvoked) {                                                                                                                                 
    auto pageFaultManager = std::make_unique<MockPageFaultManagerLinux>();                                                                                                                                   
    EXPECT_FALSE(pageFaultManager->handlerInvoked);                                                                                                                                                          
    std::raise(SIGSEGV);                                                                                                                                                                                     
    EXPECT_TRUE(pageFaultManager->handlerInvoked);                                                                                                                                                           
}                                                  

source: https://github.com/intel/compute-runtime/blob/03ce6681a052a6009ee10649617395a1e83138e5/shared/test/unit_test/page_fault_manager/linux/cpu_page_fault_manager_linux_tests.cpp#L26

it fails in the last line where it expects the handlerInvoked.

but if i disable this, then some later unit test fails in this same file.

stef avatar Mar 27 '20 14:03 stef

looking at strace log of a failing process:

32044 rt_sigaction(SIGSEGV, {sa_handler=0x558f235d5b20, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fb87357328f}, {sa_handler=0x558f23252bc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb87357328f}, 8) = 0
32044 writev(2, [{iov_base="installed sighandler: 0x7ffdc143"..., iov_len=39}, {iov_base=NULL, iov_len=0}], 2) = 39
32044 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [TRAP ABRT BUS USR1 SEGV PIPE ALRM TERM STKFLT CONT URG XFSZ SYS RTMIN RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], 8) = 0
32044 tkill(32044, SIGSEGV)             = 0
32044 rt_sigprocmask(SIG_SETMASK, [TRAP ABRT BUS USR1 SEGV PIPE ALRM TERM STKFLT CONT URG XFSZ SYS RTMIN RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], NULL, 8) = 0
32044 writev(1, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
32044 writev(1, [{iov_base="", iov_len=0}, {iov_base=..."comput"..., iov_len=225}], 2) = 225
32044 writev(1, [{iov_base="Expected: true", iov_len=14}, {iov_base="\n", iov_len=1}], 2) = 15
32044 rt_sigaction(SIGSEGV, {sa_handler=0x558f23252bc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fb87357328f}, NULL, 8) = 0
32044 writev(2, [{iov_base="uninstalled sighandler\n", iov_len=23}, {iov_base=NULL, iov_len=0}], 2) = 23

stef avatar Mar 27 '20 14:03 stef

for comparison here is the strace of a successful test:

365   rt_sigaction(SIGSEGV, {sa_handler=0x55ad0b126b20, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f6a6cc5228f}, {sa_handler=0x55ad0ada3bc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags= SA_RESTORER|SA_RESTART, sa_restorer=0x7f6a6cc5228f}, 8) = 0
365   writev(2, [{iov_base="installed sighandler: 0x7fff7d1370e0 0\n", iov_len=39}, {iov_base=NULL, iov_len=0}], 2) = 39
365   rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [BUS FPE USR1 ALRM TERM CHLD CONT TTIN XFSZ PROF WINCH IO PWR SYS RT_1 RT_2 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], 8) = 0
365   tkill(365, SIGSEGV)               = 0
365   rt_sigprocmask(SIG_SETMASK, [BUS FPE USR1 ALRM TERM CHLD CONT TTIN XFSZ PROF WINCH IO PWR SYS RT_1 RT_2 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], NULL, 8) = 0
365   --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=365, si_uid=1000} ---
365   writev(2, [{iov_base="called sighandler\n", iov_len=18}, {iov_base=NULL, iov_len=0}], 2) = 18
365   writev(2, [{iov_base="in lambda before prev handler\n", iov_len=30}, {iov_base=NULL, iov_len=0}], 2) = 30
365   writev(2, [{iov_base="in lambda after prev handler\n", iov_len=29}, {iov_base=NULL, iov_len=0}], 2) = 29
365   rt_sigreturn({mask=[BUS FPE USR1 ALRM TERM CHLD CONT TTIN XFSZ PROF WINCH IO PWR SYS RT_1 RT_2 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15]}) = 0
365   rt_sigaction(SIGSEGV, {sa_handler=0x55ad0ada3bc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f6a6cc5228f}, NULL, 8) = 0
365   writev(2, [{iov_base="uninstalled sighandler\n", iov_len=23}, {iov_base=NULL, iov_len=0}], 2) = 23

stef avatar Mar 27 '20 14:03 stef

from comparing the two straces it seems that there is no signal delivered to the process after tkill

stef avatar Mar 27 '20 14:03 stef

the parameters to sigprocmask seem to be different around tkill.

stef avatar Mar 27 '20 20:03 stef

so the important detail from the strace logs is immediately before the quoted part, it comes after

Iteration: 1. random_seed: 0

is printed on the console, for the successful test it is this:

rt_sigprocmask(SIG_SETMASK, [BUS FPE KILL USR1 ALRM TERM CHLD CONT TTIN XFSZ PROF WINCH IO PWR SYS ...

for the failing test it is:

rt_sigprocmask(SIG_SETMASK, [TRAP ABRT BUS FPE KILL USR1 SEGV USR2 PIPE STKFLT CHLD XFSZ PROF WINCH

debugging UnitTestImpl::RunAllTests() now.

stef avatar Mar 28 '20 01:03 stef

i've been trying to find out where this rt_sigprocmask(SIG_SETMASK, [ is called from, so i plastered https://github.com/intel/compute-runtime/blob/03ce6681a052a6009ee10649617395a1e83138e5/third_party/gtest/gmock-gtest-all.cc#L3965 with trace output. it seems to come from

    internal::HandleExceptionsInMethodIfSupported(
        this, &Test::TestBody, "the test body");

what is interesting, this Test::Run() is being called 10 times before the testcase PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked

this is how i modified Test::Run:

void Test::Run() {
  if (!HasSameFixtureClass()) return;

  fprintf(stderr,"1\n");
  internal::UnitTestImpl* const impl = internal::GetUnitTestImpl();
  fprintf(stderr,"2\n");
  impl->os_stack_trace_getter()->UponLeavingGTest();
  fprintf(stderr,"3\n");
  internal::HandleExceptionsInMethodIfSupported(this, &Test::SetUp, "SetUp()");
  fprintf(stderr,"4\n");
  // We will run the test only if SetUp() was successful and didn't call
  // GTEST_SKIP().
  if (!HasFatalFailure() && !IsSkipped()) {
    impl->os_stack_trace_getter()->UponLeavingGTest();
  fprintf(stderr,"5\n");
    internal::HandleExceptionsInMethodIfSupported(
        this, &Test::TestBody, "the test body");
  fprintf(stderr,"6\n");
  }

  // However, we want to clean up as much as possible.  Hence we will
  // always call TearDown(), even if SetUp() or the test body has
  // failed.
  fprintf(stderr,"7\n");
  impl->os_stack_trace_getter()->UponLeavingGTest();
  fprintf(stderr,"8\n");
  internal::HandleExceptionsInMethodIfSupported(
      this, &Test::TearDown, "TearDown()");
  fprintf(stderr,"9\n");
}

and this is the strace (annotated) output:

...
unlink("igdrcl.log")              = -1 ENOENT (No such file or directory)
brk(0x562ec7edb000)               = 0x562ec7edb000
writev(1, [{iov_base="Iteration: 1", iov_len=12}, {iov_base="\n", iov_len=1}], 2) = 13
writev(1, [{iov_base="Iteration: 1. random_seed: 0", iov_len=28}, {iov_base="\n", iov_len=1}], 2) = 29

here is the first interesting output from the testcase: Iteration: 1. random_seed: 0

brk(0x562ec7edc000)               = 0x562ec7edc000
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2

here is the first invocation of Test::Run()

brk(0x562ec7edc000)               = 0x562ec7edc000
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
brk(0x562ec7edd000)               = 0x562ec7edd000
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2

this is the 5th trace output, so now comes the invocation of internal::HandleExceptionsInMethodIfSupported, with a lot of rt_sigprocmask(SIG_SETMASK, calls, the last one being the one of interest:

rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd0c2c0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd0c056000
mincore(0x7ffd4fe4ab1f, 1, 0x7ffd4fe4ab1f) = -1 EINVAL (Invalid argument)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
msync(0x7ffd4fe4b000, 4096, MS_ASYNC) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], ~[KILL STOP RTMIN RT_1 RT_2], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1 RT_2], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [TRAP ABRT BUS FPE KILL SEGV USR2 ALRM STKFLT STOP TTOU URG XCPU XFSZ VTALRM PROF WINCH SYS RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], NULL, 8) = 0

after this Test::Run() finishes, and is called 9 more times according to the trace output of Test::Run():

writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
brk(0x562ec7ede000)               = 0x562ec7ede000
brk(0x562ec7edf000)               = 0x562ec7edf000
brk(0x562ec7ef1000)               = 0x562ec7ef1000
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2

finally we get into the Test::Run() that runs PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked:

writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
rt_sigaction(SIGSEGV, {sa_handler=0x562eb190db20, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdd0c27528f}, {sa_handler=0x562eb158abc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fdd0c27528f}, 8) = 0
writev(2, [{iov_base="installed sighandler: 0x7ffd4fe4bb10 0\n", iov_len=39}, {iov_base=NULL, iov_len=0}], 2) = 39
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [TRAP ABRT BUS FPE SEGV USR2 ALRM STKFLT TTOU URG XCPU XFSZ VTALRM PROF WINCH SYS RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], 8) = 0
tkill(7979, SIGSEGV)              = 0
rt_sigprocmask(SIG_SETMASK, [TRAP ABRT BUS FPE SEGV USR2 ALRM STKFLT TTOU URG XCPU XFSZ VTALRM PROF WINCH SYS RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], NULL, 8) = 0
writev(1, [{iov_base="", iov_len=0}, {iov_base="\n", iov_len=1}], 2) = 1
writev(1, [{iov_base="", iov_len=0}, {iov_base="compute-runtime-20.08.15750/core/unit_tests/page_fault_manager/linux/cpu_page_faul"..., iov_len=225}], 2) = 225
writev(1, [{iov_base="Expected: true", iov_len=14}, {iov_base="\n", iov_len=1}], 2) = 15
rt_sigaction(SIGSEGV, {sa_handler=0x562eb158abc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fdd0c27528f}, NULL, 8) = 0
writev(2, [{iov_base="uninstalled sighandler\n", iov_len=23}, {iov_base=NULL, iov_len=0}], 2) = 23
writev(2, [{iov_base="6\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="7\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="8\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="9\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(1, [{iov_base="[  FAILED  ][ SKL ][ 0 ] PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked", iov_len=92}, {iov_base="\n", iov_len=1}], 2) = 93
writev(2, [{iov_base="1\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="2\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="3\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="4\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
writev(2, [{iov_base="5\n", iov_len=2}, {iov_base=NULL, iov_len=0}], 2) = 2
rt_sigaction(SIGSEGV, {sa_handler=0x562eb190db20, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdd0c27528f}, {sa_handler=0x562eb158abc0, sa_mask=~[KILL STOP RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fdd0c27528f}, 8) = 0
writev(2, [{iov_base="installed sighandler: 0x7ffd4fe4ba20 0\n", iov_len=39}, {iov_base=NULL, iov_len=0}], 2) = 39
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdd0c046000
mprotect(0x7fdd0c046000, 65536, PROT_NONE) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=7979, si_uid=1000} ---
+++ killed by SIGSEGV (core dumped) +++

and as seen there is no segfault delivered after the tkill invocation.

My understanding was, that the first test case is PageFaultManagerLinuxTest.whenPageFaultIsRaisedThenHandlerIsInvoked, now i wonder what the previous 10 calls are, and especially what the first case is that sets the sigprocmask? Also i wonder why the first call to internal::HandleExceptionsInMethodIfSupported does so much more than in the latter cases where it does nothing, and also why this function is non-deterministic?

stef avatar Mar 28 '20 22:03 stef

after some more investigating, the sigprocmask is set in https://github.com/intel/compute-runtime/blob/a1c6b0f4a7442513424dddc9b296ebe4d53b9ef7/shared/test/unit_test/page_fault_manager/cpu_page_fault_manager_tests.cpp#L58

this also explains where those 10 extra test cases are coming from, which i wondered about in the previous comment.

stef avatar Mar 28 '20 23:03 stef

it seems to boil down to this macro and its try/catch: https://github.com/intel/compute-runtime/blob/0e202aaf67e7bb765f1faeafff8bf372781ae707/third_party/gtest/gtest/gtest.h#L7750

stef avatar Mar 28 '20 23:03 stef

This macro is not used in this test so that is intresting. Does it ever fail if you only run this one test ? (gtest filter may be used to selectively only run one test)

MichalMrozek avatar Apr 01 '20 09:04 MichalMrozek

the problem is that this macro/test does this:

rt_sigprocmask(SIG_SETMASK, [TRAP ABRT BUS FPE KILL SEGV USR2 ALRM STKFLT STOP TTOU URG XCPU XFSZ VTALRM PROF WINCH SYS RT_1 RT_3 RT_4 RT_5 RT_6 RT_7 RT_8 RT_9 RT_10 RT_11 RT_12 RT_13 RT_14 RT_15], NULL, 8) = 0

the problem is it sets also SEGVand this sigprocmask is never reset later, and then in a later tests that does this:

TEST(PageFaultManagerLinuxTest, whenPageFaultIsRaisedThenHandlerIsInvoked) {                                                                                                                                 
    auto pageFaultManager = std::make_unique<MockPageFaultManagerLinux>();                                                                                                                                   
    EXPECT_FALSE(pageFaultManager->handlerInvoked);                                                                                                                                                          
    std::raise(SIGSEGV);                                                                                                                                                                                     
    EXPECT_TRUE(pageFaultManager->handlerInvoked);                                                                                                                                                           
}                                                 

the std::raise does not trigger the pagefault handler and thus the EXPECT_TRUE fails.

stef avatar Apr 01 '20 11:04 stef

but strangely this rt_sigprocmask is called half of the time with these params, and half of the time with other params, that do not set SEGV. Using my gdb i've descendet into the bowels of libstdc++ and am no looking at what happens between the throw and the catch, i have a suspicion that libunwind might be somehow involved.

stef avatar Apr 01 '20 11:04 stef

There are no plans for proactive enabling and distro adoption except for Ubuntu. We are open to accepting contributions (pull requests) that unblock support.

AdamCetnerowski avatar Nov 24 '22 08:11 AdamCetnerowski