FiberTaskingLib icon indicating copy to clipboard operation
FiberTaskingLib copied to clipboard

Improper Removal of Fibers from ftl::AtomicCounter's Waiting List

Open cwfitzgerald opened this issue 5 years ago • 9 comments

This is a tracking issue for the problems that I am encountering with #66. If I run the fibtex test suite, eventually there will be a fiber in the fiber waiting slot marked as taken after all fibers have released their fibtex locks. Seems to be a race, but it repeatable.

Note: AtomicCounter::m_freeSlots is the array with the problem.

cwfitzgerald avatar Nov 15 '18 18:11 cwfitzgerald

ee5bba4 on #70 doesn't help?

martty avatar Nov 15 '18 18:11 martty

Let me check, haven't tried it.

cwfitzgerald avatar Nov 15 '18 19:11 cwfitzgerald

Yeah doesn't seem to help. Still end up running out of fiber slots.

cwfitzgerald avatar Nov 15 '18 19:11 cwfitzgerald

This appears to be a race condition within the thread local storage of the fibers. There are a couple of race conditions TSan has caught, but some of them may not be an issue, just need to be marked as relaxed atomic. This particular race is suspicious:

WARNING: ThreadSanitizer: data race (pid=25101)
  Write of size 8 at 0x7b0400000160 by thread T1:
    #0 operator delete(void*) <null> (libtsan.so.0+0x7424d)
    #1 ftl::TaskScheduler::FiberStart(void*) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:138 (ftl-test+0x4b9282)
    #2 futex_main_task(ftl::TaskScheduler*, void*) /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:136 (ftl-test+0x464c17)
    #3 ftl::TaskScheduler::ThreadStart(void*) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:63 (ftl-test+0x4b8e02)
    #4 <null> <null> (libtsan.so.0+0x2970d)

  Previous atomic write of size 1 at 0x7b0400000160 by main thread:
    #0 __tsan_atomic8_store <null> (libtsan.so.0+0x65fee)
    #1 std::__atomic_base<bool>::store(bool, std::memory_order) /usr/include/c++/8/bits/atomic_base.h:374 (ftl-test+0x4b8763)
    #2 std::atomic<bool>::store(bool, std::memory_order) /usr/include/c++/8/atomic:103 (ftl-test+0x4b8763)
    #3 ftl::TaskScheduler::CleanUpOldFiber() /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:515 (ftl-test+0x4bb154)
    #4 ftl::TaskScheduler::FiberStart(void*) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:114 (ftl-test+0x4b90e8)
    #5 make_fcontext <null> (ftl-test+0x4c0aae)
    #6 ftl::TaskScheduler::WaitForCounter(ftl::AtomicCounter*, unsigned int, bool) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:596 (ftl-test+0x4bb587)
    #7 ftl::Fibtex::lock(bool) /home/connor/Programming/fiber-tasking-library/source/../include/ftl/fibtex.h:70 (ftl-test+0x465686)
    #8 ftl::UniqueLock<ftl::Fibtex>::lock(bool) /home/connor/Programming/fiber-tasking-library/source/../include/ftl/fibtex.h:331 (ftl-test+0x465d59)
    #9 unique_lock_guard_test(ftl::TaskScheduler*, void*) /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:85 (ftl-test+0x4646bb)
    #10 ftl::TaskScheduler::FiberStart(void*) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:175 (ftl-test+0x4b9545)
    #11 futex_main_task(ftl::TaskScheduler*, void*) /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:136 (ftl-test+0x464c17)
    #12 ftl::TaskScheduler::WaitForCounter(ftl::AtomicCounter*, unsigned int, bool) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:596 (ftl-test+0x4bb587)
    #13 futex_main_task(ftl::TaskScheduler*, void*) /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:136 (ftl-test+0x464c17)
    #14 ftl::TaskScheduler::MainFiberStart(void*) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:84 (ftl-test+0x4b8eb4)
    #15 make_fcontext <null> (ftl-test+0x4c0aae)
    #16 ftl::TaskScheduler::Run(unsigned int, void (*)(ftl::TaskScheduler*, void*), void*, unsigned int, ftl::EmptyQueueBehavior) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:320 (ftl-test+0x4ba453)
    #17 FunctionalTests_LockingTests_Test::TestBody() /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:145 (ftl-test+0x464e97)
    #18 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null> (ftl-test+0x49f87e)
    #19 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2437 (ftl-test+0x496a85)
    #20 testing::Test::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2473 (ftl-test+0x471dcb)
    #21 testing::TestInfo::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2651 (ftl-test+0x47299f)
    #22 testing::TestCase::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2769 (ftl-test+0x4732f6)
    #23 testing::internal::UnitTestImpl::RunAllTests() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:4665 (ftl-test+0x47bd4f)
    #24 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) <null> (ftl-test+0x4a127e)
    #25 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2437 (ftl-test+0x498037)
    #26 testing::UnitTest::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:4277 (ftl-test+0x47a1e2)
    #27 RUN_ALL_TESTS() /home/connor/Programming/fiber-tasking-library/third_party/gtest/include/gtest/gtest.h:2231 (ftl-test+0x4b75be)
    #28 main /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest_main.cc:37 (ftl-test+0x4b74dc)

  Thread T1 (tid=25103, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x2bd4e)
    #1 ftl::CreateThread(unsigned int, void* (*)(void*), void*, unsigned long, unsigned long*) /home/connor/Programming/fiber-tasking-library/source/../include/ftl/thread_abstraction.h:274 (ftl-test+0x4bbbcd)
    #2 ftl::TaskScheduler::Run(unsigned int, void (*)(ftl::TaskScheduler*, void*), void*, unsigned int, ftl::EmptyQueueBehavior) /home/connor/Programming/fiber-tasking-library/source/task_scheduler.cpp:296 (ftl-test+0x4ba2db)
    #3 FunctionalTests_LockingTests_Test::TestBody() /home/connor/Programming/fiber-tasking-library/tests/fibtex/locking_tests.cpp:145 (ftl-test+0x464e97)
    #4 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) <null> (ftl-test+0x49f87e)
    #5 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2437 (ftl-test+0x496a85)
    #6 testing::Test::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2473 (ftl-test+0x471dcb)
    #7 testing::TestInfo::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2651 (ftl-test+0x47299f)
    #8 testing::TestCase::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2769 (ftl-test+0x4732f6)
    #9 testing::internal::UnitTestImpl::RunAllTests() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:4665 (ftl-test+0x47bd4f)
    #10 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) <null> (ftl-test+0x4a127e)
    #11 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:2437 (ftl-test+0x498037)
    #12 testing::UnitTest::Run() /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest.cc:4277 (ftl-test+0x47a1e2)
    #13 RUN_ALL_TESTS() /home/connor/Programming/fiber-tasking-library/third_party/gtest/include/gtest/gtest.h:2231 (ftl-test+0x4b75be)
    #14 main /home/connor/Programming/fiber-tasking-library/third_party/gtest/src/gtest_main.cc:37 (ftl-test+0x4b74dc)

SUMMARY: ThreadSanitizer: data race (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x7424d) in operator delete(void*)
==================
fish: “cmake-build-debug/tests/ftl-tes…” terminated by signal SIGSEGV (Address boundary error)

cwfitzgerald avatar Nov 15 '18 21:11 cwfitzgerald

Hmm.

Perhaps the read is being reordered with the delete

if (!iter->second->load(std::memory_order_relaxed)) {
    continue;
}

waitingFiberIndex = iter->first;
delete iter->second;
tls.ReadyFibers.erase(iter);
break;

iter->second->load(std::memory_order_seq_cst) should force prevent re-ordering. Or we can keep relaxed and just put in a memory barrier.

RichieSams avatar Nov 15 '18 22:11 RichieSams

I think we should be able to get away with just std::memory_order_acquire as that's a barrier for code from below rising above.

Note that this is with optimization off and x86 already guarantees acquire semantics on mov.

cwfitzgerald avatar Nov 16 '18 01:11 cwfitzgerald

Raising it all the way to std::memory_order_seq_cst didn't do anything, we still have the same problem.

cwfitzgerald avatar Nov 16 '18 03:11 cwfitzgerald

To better help me understand what is going on, what does that call force synchronization of that allows the delete and erase to fire?

cwfitzgerald avatar Nov 16 '18 03:11 cwfitzgerald

When AtomicCounter finds a waiting task that's ready to be resumed, we add it to that thread's ReadyFiberList in TLS: https://github.com/RichieSams/FiberTaskingLib/blob/master/source/task_scheduler.cpp#L517

However, when a thread looks through the ReadyFiberList, it can't just remove the first fiber in the list and switch to it. The fiber may still be in use.

Imagine this scenario:

  • Thread 1 calls WaitForCounter() and successfully adds its fiber to the AtomicCounter and starts switching fibers, but gets pre-empted before can actually do the switch.
  • Thread 2 decrements the AtomicCounter, and now it's "ready". It adds it to its ReadyFiberList
  • Thread 2 finishes the current task and searches it's ReadyFiberList and finds the ready fiber and tries to switch to it
  • Registers aren't set up yet for the "ready" fiber. IE. Register corruption
  • Thread 1 and 2 now try to run on the same stack

To prevent this, we have TLS.OldFiberStoredFlag. After switching fibers, all code paths will call CleanUpOldFiber() https://github.com/RichieSams/FiberTaskingLib/blob/master/source/task_scheduler.cpp#L451

This sets OldFiberStoredFlag to true. When we iterate over the ReadyFiberList, we have to check this flag to see if the Fiber is safely switched away from. If it's false, we skip the ready fiber.


When you introduce task thread pinning, a ready Task must be added to another thread's ReadyFiber list. therefore we synchronize access to the list with tls.ReadFibersLock. That's actually a typo in the name. It should be ReadyFibersLock.

RichieSams avatar Nov 16 '18 21:11 RichieSams