osgearth icon indicating copy to clipboard operation
osgearth copied to clipboard

Deadlock on JobArena

Open Davidbrcz opened this issue 4 years ago • 6 comments

Hello

We recently made the upgrade from osgEarth 2.8 to 3.2, and our application (built on MSYS2, G++ 8.2, Windwos 10, 64 bits) is hanging, while it's fine on Linux.

The following code is enough to reproduce the issue:

#include <osg/DisplaySettings>
#include <osg/PositionAttitudeTransform>
#include <osg/LOD>
#include <osgDB/Registry>

#include <osgEarthDrivers/cache_filesystem/FileSystemCache>
#include <osgViewer/Viewer>
#include <osgViewer/ViewerBase>
#include <osgEarth/Capabilities>
#include <osgEarth/Registry>
#include <osgEarth/Sky>
#include <osgEarth/TerrainEngineNode>
#include <osgEarth/Registry>
#include <osgEarth/Notify>
#include <osgEarth/MapNode>
#include <osgEarth/TerrainEngineNode>
#include <osgEarth/Sky>

// This header has to be last. It defines conditionally GLintptr
// and GLsizeiptr. If included first, it clashes with the typedef
// that Qt have in qopenglext.h
#include <osgEarth/GLUtils>

#include <osg/ref_ptr>
#include <osg/Referenced>

#include <iostream>

namespace {
struct OsgLogBridge : public osg::NotifyHandler
{
	void notify(osg::NotifySeverity severity, const char* message) override
	{
	    std::string simplifiedMessage = message;
	    std::cerr << simplifiedMessage;
	}
};


class GlobeTest : public osg::Referenced
{
public:
	explicit GlobeTest();
protected:
	osg::ref_ptr<osg::Group> _globeNode;
	osg::ref_ptr<osgEarth::Map> _map;
	osg::ref_ptr<osgEarth::MapNode> _mapNode;
};


void buildGlobeOptions(
		       osgEarth::MapNode::Options& mapNodeOptions,
		       osgEarth::TerrainOptions& terrainOptions,
		       osgEarth::Util::SkyOptions& skyOptions) {

	    terrainOptions.minTileRangeFactor() = 8;

	    mapNodeOptions.overlayResolutionRatio() = 20.0f;
	    mapNodeOptions.overlayTextureSize() = osgEarth::Registry::instance()->capabilities().getMaxTextureSize() / 8;
	    mapNodeOptions.enableLighting() = false;

	    skyOptions.ambient() = 0.8f;
	    skyOptions.hours() = osgEarth::DateTime().hours();
}

GlobeTest::GlobeTest() :
	_globeNode(new osg::Group())
{
	osgEarth::MapNode::Options mapNodeOptions;
	osgEarth::TerrainOptions terrainOptions;
	osgEarth::Util::SkyOptions skyOptions;

	buildGlobeOptions(mapNodeOptions, terrainOptions, skyOptions);

	_globeNode = new osg::Group();
	osgEarth::Map::Options mapOptions;

	// create the map from the options
	_map = new osgEarth::Map(mapOptions, nullptr);
	mapNodeOptions.terrain() = terrainOptions;

	// create the mapNode from the options
	_mapNode = new osgEarth::MapNode(_map, mapNodeOptions);
	_globeNode->addChild(_mapNode);

	if (!_mapNode->open())
	{
	    std::cerr << "Error opening the map node";
	}

}


} // end namespace


int main(void) {
	// allocate pager threads based on CPU configuration.
	int cores = 6;
	std::cerr << "Number of cores:" << cores;
	osg::DisplaySettings::instance()->setNumOfDatabaseThreadsHint(osg::clampAbove(cores, 2));
	osg::DisplaySettings::instance()->setNumOfHttpDatabaseThreadsHint(osg::clampAbove(cores / 2, 1));

	OsgLogBridge* osgLogBridge = new OsgLogBridge();
	osg::setNotifyHandler(osgLogBridge);
	osgEarth::setNotifyHandler(osgLogBridge);
	osg::setNotifyLevel(osg::INFO);
	osgEarth::setNotifyLevel(osg::INFO);

	// create a globe
	GlobeTest _globe;
}

There is a single thread and the stack trace looks like this:

Details

(gdb) bt bt #0 0x00007ff8f3c6d8c4 in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll #1 0x00007ff8f17bcb70 in WaitForMultipleObjectsEx () from C:\Windows\System32\KernelBase.dll #2 0x00007ff8f17bca6e in WaitForMultipleObjects () from C:\Windows\System32\KernelBase.dll #3 0x0000000064941e5d in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll #4 0x000000006494216b in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll #5 0x00000000649421e5 in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll #6 0x00000000015b9288 in std::_V2::condition_variable_any::~condition_variable_any (this=0x284bff08, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/condition_variable:234 #7 0x000000000119597b in osgEarth::Threading::JobArena::~JobArena (this=0x284bfe50, __in_chrg=) at C:/.conan/4c5d3f/1/source_subfolder/src/osgEarth/Threading.cpp:578 #8 0x000000000147232c in __gnu_cxx::new_allocatorosgEarth::Threading::JobArena::destroyosgEarth::Threading::JobArena (this=0x284bfe50, __p=0x284bfe50) at C:/msys64/mingw64/include/c++/8.2.0/ext/new_allocator.h:140 #9 0x0000000001593570 in std::allocator_traits<std::allocatorosgEarth::Threading::JobArena >::destroyosgEarth::Threading::JobArena (__a=..., __p=0x284bfe50) at C:/msys64/mingw64/include/c++/8.2.0/bits/alloc_traits.h:487 #10 0x00000000015b7479 in std::_Sp_counted_ptr_inplace<osgEarth::Threading::JobArena, std::allocatorosgEarth::Threading::JobArena, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x284bfe40) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:552 #11 0x0000000001589a6f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x284bfe40) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:155 #12 0x0000000001583fe7 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x284c3e70, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:706 #13 0x000000000157ab1c in std::__shared_ptr<osgEarth::Threading::JobArena, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x284c3e68, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:1145 #14 0x0000000001551358 in std::shared_ptrosgEarth::Threading::JobArena::~shared_ptr (this=0x284c3e68, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr.h:103 #15 0x00000000015c478c in std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >::~pair (this=0x284c3e48, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/stl_pair.h:198 #16 0x0000000001474aac in __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >, true> >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena > > ( this=0x19a5b40 osgEarth::Threading::JobArena::_arenas[abi:cxx11], __p=0x284c3e48) at C:/msys64/mingw64/include/c++/8.2.0/ext/new_allocator.h:140 #17 0x0000000001595110 in std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >, true> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena > > (__a=..., __p=0x284c3e48) at C:/msys64/mingw64/include/c++/8.2.0/bits/alloc_traits.h:487 #18 0x00000000016998e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >, true> > >::_M_deallocate_node (this=0x19a5b40 osgEarth::Threading::JobArena::_arenas[abi:cxx11], __n=0x284c3e40) at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable_policy.h:2100 #19 0x00000000016999ff in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >, true> > >::_M_deallocate_nodes (this=0x19a5b40 osgEarth::Threading::JobArena::_arenas[abi:cxx11], __n=0x0) at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable_policy.h:2113 #20 0x0000000001541081 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=0x19a5b40 osgEarth::Threading::JobArena::_arenas[abi:cxx11]) at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable.h:2047 #21 0x000000000157e508 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::shared_ptrosgEarth::Threading::JobArena, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::shared_ptrosgEarth::Threading::JobArena > > >::clear (this=0x19a5b40 osgEarth::Threading::JobArena::_arenas[abi:cxx11]) at C:/msys64/mingw64/include/c++/8.2.0/bits/unordered_map.h:846 #22 0x0000000001195a4d in osgEarth::Threading::JobArena::shutdownAll () at C:/.conan/4c5d3f/1/source_subfolder/src/osgEarth/Threading.cpp:599 #23 0x00000000013794d4 in _execute_onexit_table (table=0x199e000 <atexit_table>) from C:\msys64\home\d.come\tmp\map3d\tests\build\05e94a83e88b84b752457d5436a46ca909bdbbb2\bin\libosgEarthd.dll #24 0x0000000000ff116e in _CRT_INIT (hDllHandle=, dwReason=, lpreserved=0x1) at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtdll.c:140 #25 0x0000000000ff1249 in __DllMainCRTStartup (hDllHandle=0xff0000, dwReason=0, lpreserved=0x1) at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtdll.c:204 #26 0x00007ff8f3be9a1d in ntdll!RtlActivateActivationContextUnsafeFast () from C:\Windows\SYSTEM32\ntdll.dll #27 0x00007ff8f3c2db91 in ntdll!LdrShutdownProcess () from C:\Windows\SYSTEM32\ntdll.dll #28 0x00007ff8f3c2da2d in ntdll!RtlExitUserProcess () from C:\Windows\SYSTEM32\ntdll.dll #29 0x00007ff8f2c0e0ab in KERNEL32!FatalExit () from C:\Windows\System32\kernel32.dll #30 0x00007ff8f39ea155 in msvcrt!_exit () from C:\Windows\System32\msvcrt.dll #31 0x00007ff8f39ea7c5 in msvcrt!_initterm_e () from C:\Windows\System32\msvcrt.dll #32 0x00000000004014b5 in __tmainCRTStartup () at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:342 #33 0x000000000040150b in mainCRTStartup () at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:223

In the logs, I see many

[Thread 79716.0x137c8 exited with code 0]
[Thread 79716.0x137c4 exited with code 0]
// ...

I dig into the threading code, and I'll put the remarks I originally put in there for completeness.

I've been looking lately at Threading and Threading.cpp and I have several concerns I would like to discuss

  • I believe there is data-race on _done. Indeed, the variable is defined here, doesn't have an atomic type, is written to here and read by several other threads here without locking.

  • Even if _done were atomic, the variable is still used in _block condition_variable predicate here . For proper synchronization, that variable should be set while holding the same locked ( the last section of this page has an explanation)

Thanks.

Davidbrcz avatar Oct 12 '21 11:10 Davidbrcz

I was not able to reproduce the deadlock on Win64 VS2019.

A couple things you might try to test your synchronization theory:

  • in stopThreads() move the statement _done = true inside the _queueMutex lock. This would effectively synchronize _done to the queue mutex per your suggestion.
  • replace the _block.wait statement to remove the predicate, so it instead reads if(_queue.empty() && !_done) _block.wait(lock);

Let me know if either approach solves the problem.

gwaldron avatar Oct 12 '21 13:10 gwaldron

Hey,

I have two commits with different approaches.

One tries to implement minimal changes that fixes some of the issues found in the original message. But it still deadlocks and I don't know why =(

Another one reworks a bit more the threading code and introduces an abstraction for a thread safe working queue. With that code, the test (and our main application) doesn't deadlock \o/. Regardless of correctness, I think this one is easier to understand as several concepts gathered into a single class and used only over a few lines.

I didn't follow through on the predicateless calls on wait as it violates some static analyzer rules we have.

Davidbrcz avatar Oct 19 '21 11:10 Davidbrcz

I've updated slightly the commit that reworks the threading more deeply to address an issue someone found

Davidbrcz avatar Oct 22 '21 14:10 Davidbrcz

Hi,

Just adding my two cents. Currently get the same issue on mingw64 MSYS2 GCC 9.1.0 with osgEarth 3.1

Just the addition of a map (single geotiff) is enough to reproduce the lock when closing. ie osg::Node* globe = osgDB::readNodeFile("myglobe.earth");

After a version upgrade, I approved David's changes

Makamitsu avatar Nov 15 '21 09:11 Makamitsu

Unfortunately, the situation is not as good as I expected.

The original commit didn't have a call to notify_all in SynchronizedStopQueue::stop. It was an oversight and should have been a bug, but it works without it, and worse; with it it deadlocks in a similar fashion to the original deadlock !

I have been trying to narrow it down, but I'm kind of stuck. The following demo program (a simple queue with some threads passing integers around) with the fixed SynchronizedStopQueue works as expected (ie, works with the fixed version, doesn't when the line // A is commented).

#include <iostream>
#include <vector>
#include <atomic>
#include <functional>
#include <mutex>
#include <condition_variable>
#include <vector>
#include <unordered_map>
#include <queue>
#include <thread>
#include <future>
#include <type_traits>
#include <queue>
#include <map>
#include <sstream>

template <class U>
std::string mystr(U u){
        std::stringstream ss;
        ss << u;
        return ss.str();
}

using Mutex =std::mutex;
using QueuedJob = int;
    class SynchronizedPriorityQueuedJob {
    public:
        ~SynchronizedPriorityQueuedJob()
        {
          std::cerr<<"SynchronizedPriorityQueuedJob destructor" <<std::endl;
        }
        template<class ...Ts>
        void emplace(Ts... args) {
          std::lock_guard<Mutex> lock(_queueMutex);
          _queue.emplace(std::forward<Ts>(args)...);
          _block.notify_one();
        }

        bool interrupt_pop(QueuedJob& next) {
          std::unique_lock<Mutex> lk(_queueMutex);
          _block.wait(lk,[this]{
                                        auto message = std::string("In predicate ") +  mystr(std::this_thread::get_id()) + " done="+std::to_string(_done)+"\n";
                std::cerr << message   << std::flush;
                return !_queue.empty() || _done;
        });
        if(_done) {
                                        auto message = std::string("done in  ") +  mystr(std::this_thread::get_id()) + "\n";
                std::cerr << message   << std::flush;
            return false;
        }
        next = std::move(_queue.top());
        _queue.pop();
        return true;
        }

        bool try_pop(QueuedJob& next) {
          std::unique_lock<Mutex> lk(_queueMutex);
          if(_queue.empty()){
              return false;
          }

          next = std::move(_queue.top());
          _queue.pop();
          return true;
        }

        bool empty() const
        {
          std::lock_guard<Mutex> lk(_queueMutex);
          return _queue.empty();
        }
        int size() const
        {
          std::lock_guard<Mutex> lk(_queueMutex);
          return _queue.size();
        }

        void stop() {
          std::lock_guard<Mutex> lk(_queueMutex);
          std::cerr << "STOP ALL THREADS" << std::endl;
          _done = true;
          _block.notify_all(); // line A
        }
        bool running() const{
        std::lock_guard<Mutex> lk(_queueMutex);
        return !_done;
        }
    private:
        std::priority_queue<int> _queue;
        mutable Mutex _queueMutex;
        std::condition_variable_any _block;
        bool _done{false};
    };

struct JobArena {
    static std::map<std::string, std::shared_ptr<JobArena>> _arenas;
    std::vector<std::thread> _threads;
        SynchronizedPriorityQueuedJob _queue;
    JobArena()
    {
        for(int i = 0; i < 10; ++i) {
            _threads.push_back(
                std::thread([this]
                    {
                            auto message = std::string("Arena starting thread ") +  mystr(std::this_thread::get_id()) + "\n";
                            std::cerr << message  << std::endl;
                            runJobs();
                            // exit thread here
                            message = std::string("Thread ") + mystr(std::this_thread::get_id()) + " end loop \n";
                            std::cerr << message << std::endl;
                    }
                    ));
        }
    }
    void runJobs(){
            while (_queue.running())
            {

                    int next;
                    if (_queue.interrupt_pop(next))
                    {
                                                    auto message = std::string("executed") + std::to_string(next) + "\n";
                            std::cerr << message  << std::endl;
                    }
            }
    }
        void stopThread(){
                _queue.stop();

                // Clear out the queue
                while(!_queue.empty())
                {
                        int job;
                        if(_queue.try_pop(job))
                        {
                                std::cerr << "cleaned" << job << std::endl;
                                std::this_thread::sleep_for(std::chrono::milliseconds(job));
                        }
                }

                // wait for them to exit
                for (unsigned i = 0; i < _threads.size(); ++i)
                {
                        if (_threads[i].joinable())
                        {
                                _threads[i].join();
                        }
                }

                _threads.clear();
        }
    ~JobArena(){
        stopThread();
    }
        static void shutdownAll()
                {
                        std::cerr << "shutdown all" << std::endl;
                        _arenas.clear();
                }
};

std::map<std::string, std::shared_ptr<JobArena>> JobArena::_arenas;
int main(int argc, char *argv[])
{
    JobArena::_arenas["foo"] = std::shared_ptr<JobArena>(new JobArena());
    std::atexit(JobArena::shutdownAll);
    std::cerr << "gonna sleep" << std::endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(200));
    std::cerr << "slept" << std::endl;
    int base = 2666;
    for(int i = base; i < base+10; ++i){
             JobArena::_arenas["foo"]->_queue.emplace(i);
    }
    std::cerr << "last main" << std::endl;
    return 0;
}

There definitely something else going on with osgEarth, but I can't tell if Makamitsu and I are hitting an MSYS bug, if osgEarth's code is still not compliant with the language (UB for instance), if it's a well defined deadlock or a mixed bag of issues

I would love to be able to modify the example above to reproduce the issue and pinpoint it, but I've been unable so far.

Davidbrcz avatar Nov 17 '21 19:11 Davidbrcz

Ok, I've dug deep on this, lost quite a bit of hair but I think I have an explanation.

I think the core issues come from the current design of osgEarth in JobArena.

JobArena (as a class) owns several instances of JobArena in the static member _arenas that get destroyed when when the shared_ptrs are no longer referenced. In JobArena destructor, there are pieces of code to do thread synchronization.

The issue is that the code is in a DLL, and it interacts badly with static objects (roughly speaking, once main is finished, threads in the DLL are killed rather abruptly by Windows before JobArena destructor run, and mayhem ensures within the thread synchronization code). This page has been very helpful understanding the issue.

The issue has hit many people over time (LDD issue with this on MSYS , LLVM ThreadPool bug, DLIB issue, SO questions 1 2, MS documentation for DLL good practices ...)

I've been able to reproduce the issue with the following example testdll.zip on MSYS (not tested on MSVC, would gladly have someone test it). I'm not sure if MSVC is immune to this or if no one has hit it by sheer luck, because the SO questions have people using MSVC.

My solution, without doing a full redesign to remove the static objects, is on top of the MR I have open,

  • Remove the atexit registration in Threading
  • to make shutdownAll public and create a small RAII helper for it, and instantiate it in main in first thing first.
struct Cleaner {
	~Cleaner(){
		osgEarth::Threading::JobArena::shutdownAll();
	}
};
int main(int argc, char** argv)
{
	Cleaner c;

I'll update the PR asap.

Davidbrcz avatar Nov 25 '21 14:11 Davidbrcz