Trilinos Make Teuchos Memory Management Classes thread-safe

This story is to address the long-standing problem that the Teuchos Memory management Classes which use reference-counting are not thread safe.

CC: @MicheldeMessieres, @jwillenbring,

Next Action Status: See tasks ...

Tasks:

Initial development and testing for multi-thread correctness [Done]
Add configure time switch for thread safety: Define configure-time options Trilinos_ENABLE_THREAD_SAFE and Teuchos_ENABLE_THREAD_SAFE (latter is given the given the default of the former value).
Turn off for Trilinos_ENABLE_CXX11=OFF: That is, set Teuchos_ENABLE_THREAD_SAFE=OFF in this case. Run full Trilinos test suite with -DTrilinos_ENABLE_CXX11=OFF.
Update the Teuchos test suite:
- Inform CTest of number of threads for thread-safe tests: Figure this out at configure time and then set NUM_TOTAL_CORES_USED (see [TRIBITS_ADD_TEST())(https://tribits.org/doc/TribitsDevelopersGuide.html#formal-arguments-tribits-add-test))
- Make pre-push BASIC test suite fast: Make the longer running threading tests NIGHTLY.
Performance testing:
- For builds:
  - -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON (Trilinos_ENABLE_THREAD_SAFE on and off)
  - -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=OFF (Trilinos_ENABLE_THREAD_SAFE on and off)
- For compilers:
  - GCC version 4.8.x .
  - Intel version 15.x
  - Clang X
- Run Trilinos (nearly full) test suite with and without thread-safety turned on.
- Run Nalu, Albany, and Drekar test suites with thread safety on and off and see the performance impact with debug-mode checking turned on.
- Request report from Cedric about usage and performance.
- If performance okay, continue. Otherwise, decide what to do.
Disallow throwing exceptions from destructors: We just need to disallow exceptions and make Teuchos MM classes abort in destructors when errors occur. Update unit tests for the case of circular references and exceptions. Need to provide TEUCHOS_ABORT_IF(<condition>) that will print and then call abort.
Merge into develop branch with Trilinos_ENABLE_THREAD_SAFE=OFF by default:
- Update teuchos/ReleaseNotes.txt to discuss exception destructor difference.
- Announce time schedule for turning this on by default.
Update documentation / Code review:
- Update unit test documentation: With final tests in place, will create a uniformly formatted summary for each in code to describe it’s purpose.
- Update RCP documentation: Need to update RCP documents to reflect these changes
- Ross reviews code, tests, and updated documentation.
Turn on Trilinos_ENABLE_THREAD_SAFE=OFF by default:
- Update teuchos/ReleaseNotes.txt
- Send out announcement
Other considerations and improvements: (move to new stories?)
1. Review Array.h mutex implementation: This was new code I added after our last review to make Array.h thread safe - I have implemented suggested tests we discussed on Github.
2. Discuss plan for debug detection of dangling weak ptr. Debug builds have checks to validate weak ptrs but those checks can fail if another thread kills the data. I’ve got tests in place which detect and demonstrate this issue but need to discuss further how we would like to address this.
3. Consider additional changes for ArrayView, ArrayRCP, Tuple, Ptr: Implemented fairly limited sanity checks on these.
4. Weak to strong conversion: Have code in place which implements thread safe upgrade of a weak ptr to a strong ptr, along with a unit test, but the role of this is unclear at the moment.
5. Make tests have inverted case: Tests should demonstrate they can detect thread problems when the fix is not applied - the inverted case. I’ve got some #defines set up to do this but wanted to discuss how to best organize those. Many of the inverted tests will need separate main functions.

Mar 18 '16 23:03 bartlettroscoe

Developers at Tech-X have created a branch with updates to help make RCP thread-safe. It is on the branch 'atomic_rcp' in the Trilinos fork [email protected]:bddavid/Trilinos.git. The file RCP_Summary_02012016.pdf describes the changes and gives some timing studies.

@bddavid, I will pull this branch and then start looking it over and running my own tests. In particular, I need to create some unit tests in the Teuchos package that show the various threading use cases. I will put these in my branch also. I will create a pull-request to allow for a more detailed code review.

Mar 19 '16 02:03 bartlettroscoe

The early document ThreadSafeRCP.atomics.pdf gives an example main() function that demonstrates the failure of thread-safety of the RCP software.

Mar 19 '16 02:03 bartlettroscoe

I fixed the debug build of the code and it passed all of the TeuchosCore unit tests. Next I will add some unit tests to check various use cases for thread safety. I think that the code is not thread safe in a debug-mode build when creating new RCP<T> objects in different threads but the unit tests will show this.

Mar 19 '16 03:03 bartlettroscoe

I have outlined below in a little more detail what I have done so far and what needs to be done next. The next task it to write multi-threaded unit tests for various use cases.

Detailed Notes:

I built the TeuchosCore subpackage on the branch thread-safe-rcp-229 (see PR #230) as of commit:

commit 1b1326a0edd4694f78daa050e89c17f8afba865f
Author: Roscoe A. Bartlett <[email protected]>
Date:   Fri Mar 18 22:50:35 2016 -0400

    Fix debug build (Trilinos #229)

    The code did not build for -DTrilinos_ENABLE_DEBUG=ON.  With this, the
    TeuchosCore NIGHTLY test suite passes.

M       packages/teuchos/core/src/Teuchos_RCPNode.hpp

and run the NIGHTLY test suite run with:

$ ./checkin-test-fissile4.sh --st-extra-builds= --no-enable-fwd-packages \
  --enable-package=TeuchosCore --test-categories=NIGHTLY --local-do-all

and it passed all the tests with:

Build test results:
-------------------
0) MPI_DEBUG => passed: passed=52,notpassed=0 (2.48 min)
1) SERIAL_RELEASE => passed: passed=47,notpassed=0 (1.19 min)

This tests that single-thread behavior is maintained, but there are no automated multi-thread tests at all.

We need to write some multi-threaded unit tests for the Teuchos MM classes to test thread-safety of all behavior, especially in debug mode builds. We need to come up with a good strategy for writing and running these tests. Really, we should run all of the unit tests with multiple threads to make sure that all of this functionality works. But I am afraid that many of these unit tests will need to be specially designed in order to create strong tests for multiple threads. Basically, any memory that can be accessed by more than one thread needs to have tests where multiple threads read and write to that memory at the same time. There are several objects that are shared between threads that are stored in the base RCPNode object:

Reference count (strong and week)
Ownership flag
Extra Data (stored as a raw pointer to an std::map object, often use for special deallocation policies in older code)
Dealloc policy object (embedded the the RCPNodeImpl<T,Dealloc_T> subclass object)

There are several test cases that need to be carefully tested. Some of the obvious tests include:

Incrementing and deincrementing RC in multiple threads and make sure RC is correct at the end (this is the test case that Tech-X showed at the end of the document ThreadSafeRCP.atomics.pdf)
Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)
Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.
Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.
Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.
Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.
Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.

The issue is that many of these multi-threaded tests will likely fail given what is there right now. What is most likely to fail are the unit tests where different RCPNode objects are created and destroyed in multiple threads when node tracing is turned on.

And we will also need to add automated tests like these involving for Teuchos classes ArrayRCP, ArrayView, Ptr, and the Tuple. That could be a lot of work.

Mar 21 '16 16:03 bartlettroscoe

@bartlettroscoe I've built https://github.com/bartlettroscoe/Trilinos/tree/thread-safe-rcp-229 and could help with adding unit tests. For starters I can add the Tech-X test case you refer to above.

Would it make sense to create new tickets for the different unit tests you propose?

Mar 28 '16 15:03 d-meiser

For starters I can add the Tech-X test case you refer to above.

@d-meiser, that would be fantastic! I created a skeleton executable for doing this. See ???. If you look at the other XXX_UnitTests.cpp files in that directory, you should see how to add unit tests. Just grab threads as with the Kitware example. Try not to use more than 4 threads for now, or these tests will not run by default on the Trilinos test machines. I can help with the details of running the tests.

Would it make sense to create new tickets for the different unit tests you propose?

I think we can use this ticket. I don't think we don't need that level of granularity. We are not doing SCRUM so we don't need to have the story broken up artificially to fit into a "sprint". This is a big story but I think that all the Teuchos MM classes need to be thread-safe, including in debug-mode, in order to provide real value. We may break off some Task tickets if there is a lot of independent streams of dialog.

Mar 28 '16 18:03 bartlettroscoe

I pull --rebased your branch and see the test skeleton. I'll use that to create the unit tests.

Should I open pull requests to your branch so you can update this PR?

Mar 28 '16 18:03 d-meiser

Starting work on the suggested multithread tests and had a few questions:

Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)

I setup this test and got errors on the node tracing as you indicated would happen - so only multithreaded debug caused the node tracing to be mangled. My first idea is to add a single static mutex on the static addNewRCPNode and removeRCPNode() functions - not sure yet if it would be better/necessary to make this more efficient and reorganize the internals. Then found there was a more subtle issue where node_->delete_obj() can be called and occasionally, before removeRCPNode() can execute, another thread can create a new node in the same memory space and incorrectly think it was already added to the trace.

To fix this I moved (for debug only) the node_->delete_obj() call into the removeRCPNode() so it is protected by the mutex. This means delete_obj() and removeRCPNode() must happen as a pair now. I’m not exactly happy about that organization of delete_obj() since it doest cleanly parallel release, so will think on this, but felt some feedback might be best before I continue further on that test.

Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.

I think we are talking about having the singleton not ‘own’ the data so if all threads are done, memory should be deallocated automatically at that time, not when the program ends, but wasn’t sure. I started with a setup where I have the threads constantly making requests and deleting so the counter of the shared RCP frequently goes to 0 (deallocates) or goes from 0 to 1 (allocates fresh new). Most useful would be any tips on the expected state of that ptr when no thread is using it to be sure I understand your idea - my first thought was it should end the test as a single weak count (strong = 0, weak = 1). Related to issue 7, I’m not clear yet on exactly what a weak ptr should be allowed to do when all strong counts go away. Also not sure if that was your intention here anyways.

Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.

I saw some examples in the non-multithreaded tests for this so I expect once i have #3 done, this should follow smoothly.

Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.

If you can clarify what you mean by setting ownership to false that would be great. On first inspection it seemed the has_ownership flag was designed to be set on constructor only - at least from the RCP standpoint. So it was not clear that there was a ‘good' way to change ownership of RCP after construction - I will look into this further.

Also you indicated the data would be allocated in main, then passed to the singleton. In case 3 for example, I i was initially thinking the singleton would handle the allocation. There may be subtle reasons you said it this way which I am missing - or perhaps if the main idea is to make sure the data is preserved when one ptr gives up ownership, it may not matter for this one.

Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.

I think the prior unit tests as examples and any insight you have on #3 will give me a fairly clear path on this one.

Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.

I tried having the main thread create an RCP, send a weak ptr of that RCP to a second thread, and then the first strong ptr goes out of scope. Now the second thread can be upset as the ptr will become invalid at some unexpected time. Does this sound like the right direction for this test? Same as #3, just starting to get a feel for what the weak ptr can do (or should never do) when the strong count is 0.

Thanks!

Apr 27 '16 14:04 MicheldeMessieres

hi Ross - one additional thing I have been doing is building some parallel cases using std::shared_ptr and std::weak_ptr. So for example, I see I can generate an intermittent problem with threads for the #3 singleton case with RCP which does not occur for a parallel case using shared_ptr/weak_ptr - which I will be investigating now. Does it make sense to describe my goal here as: We would like RCP to work for the same situations as shared_ptr and weak_ptr. I am becoming aware of some distinctions (such as RCP holds strong/weak as a pair) but if that idea is generally correct it can help guide me. If there are things shared_ptr/weak_ptr can do in a multithreaded context that we don't want to worry about for RCP, that information would be valuable for me to define the scope of this work. Many thanks.

Apr 27 '16 20:04 MicheldeMessieres

hi Ross - one additional thing I have been doing is building some parallel cases using std::shared_ptr and std::weak_ptr. So for example, I see I can generate an intermittent problem with threads for the #3 singleton case with RCP which does not occur for a parallel case using shared_ptr/weak_ptr - which I will be investigating now. Does it make sense to describe my goal here as: We would like RCP to work for the same situations as shared_ptr and weak_ptr. I am becoming aware of some distinctions (such as RCP holds strong/weak as a pair) but if that idea is generally correct it can help guide me. If there are things shared_ptr/weak_ptr can do in a multithreaded context that we don't want to worry about for RCP, that information would be valuable for me to define the scope of this work. Many thanks.

Michel, I am responding to your larger comment but I will address this question first ...

I had not thought about how strong and weak RCPs would relate in behavior to std::shared_ptr and std::weak_ptr. I suspect that the way RCP is currently set up will need some tweaking so that it has reasonable behavior in a multi-threaded application. The key issue is that with std::weak_ptr, you can't directly access the shared object. You have to create a strong owning std::shared_ptr first. The problem with the current implementation of the Teuchos::RCP classes, is that if you access the shared object through a weak RCP, then it will do a debugt-mode (atomic) assert that the shared object has not been deleted yet, and then return a reference. The problem is that the object could then be deleted by the other thread before the client is finished using it. Therefore, I think in a multi-threaded env, it may not be safe to access the object directly through a weak RCP object (at least not without some major and expensive magic).

For now, don't worry about the behavior of weak and strong RCP objects in a multi-threaded program. After we get basic thread safely working with all strong RCP objects, then I can go back and think how to address weak and strong RCPs in a multi-threaded env. It may be that you can't allow the direct access of an object through a weak RCP in a multi-threaded program (if you try, then it will throw an exception).

Thanks for thinking about this ahead of time!

Apr 27 '16 20:04 bartlettroscoe

hi - That makes sense. One thing I am struggling with is to determine exactly the scope (what things we need to solve and what things we don't). I think your responses on the specific tests will be most helpful so I clearly understand the purpose of the test and as we move along I'll have a better intuition on what is needed.

(Edited to make more concise) I suppose looking ahead we would need the equivalent of the weak_ptr.lock(), which allows conversion to shared_ptr or null if not available. So that would need thread protection and would also hurt performance.

However it sounds like that is an aside from our main goal and for the initial tests I will definitely follow your lead. Thanks.

Apr 27 '16 21:04 MicheldeMessieres

Starting work on the suggested multithread tests

Good deal!

Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)

I setup this test and got errors on the node tracing as you indicated would happen - so only multithreaded debug caused the node tracing to be mangled. My first idea is to add a single static mutex on the static addNewRCPNode and removeRCPNode() functions - not sure yet if it would be better/necessary to make this more efficient and reorganize the internals.

A mutex is fine in debug-mode only code. We don't care so much about the runtime performance of debug-mode code. We would have to be very sloppy to make the Teuchos MM classes run as slowly in debug-mode as using a tool like valgrind. And since these functions only get called when the underlying object is created or destroyed, the overhead of a mutex should not be so bad compared to the dynamic memory allocation/deallocation work.

Then found there was a more subtle issue where node_->delete_obj() can be called and occasionally, before removeRCPNode() can execute, another thread can create a new node in the same memory space and incorrectly think it was already added to the trace.

That is interesting. What use cases does that happen for? What unit tests fail?

To fix this I moved (for debug only) the node_->delete_obj() call into the removeRCPNode() so it is protected by the mutex. This means delete_obj() and removeRCPNode() must happen as a pair now. I’m not exactly happy about that organization of delete_obj() since it doest cleanly parallel release, so will think on this, but felt some feedback might be best before I continue further on that test.

We can look at how to refactor this more cleanly once we get all of the tests working. Initially, I don't mind if it is a little messy.

Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.

I think we are talking about having the singleton not ‘own’ the data so if all threads are done, memory should be deallocated automatically at that time, not when the program ends, but wasn’t sure.

I think for this to be a good test, the singleton would need to give up ownerhship after all of the clients were given an RCP to the shared object. What we want this test to do is to make sure that a shared object gets deleted correctly when multiple threads all race to delete it at the same time (i.e. remove their reference counts, not actually call 'delete').

I started with a setup where I have the threads constantly making requests and deleting so the counter of the shared RCP frequently goes to 0 (deallocates) or goes from 0 to 1 (allocates fresh new). Most useful would be any tips on the expected state of that ptr when no thread is using it to be sure I understand your idea - my first thought was it should end the test as a single weak count (strong = 0, weak = 1).

Another way to accomplish this without a singleton might be to just create the object with a strong RCP in the main thread, and then give the RCP to a bunch of slave threads and then the main thread would set its copy to null then wait on the slave threads which will remove their references. I think that would create the race to remove references and a race to delete between threads. Does that make sense?

Related to issue 7, I’m not clear yet on exactly what a weak ptr should be allowed to do when all strong counts go away. Also not sure if that was your intention here anyways.

In normal code, if the strong count goes to 0, then the managed object would be deleted. The semantics of a weak RCPs are not well defined in threaded code (see my previous comment).

Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.

I saw some examples in the non-multithreaded tests for this so I expect once i have #3 done, this should follow smoothly.

Yes, this is really the same as case-3, just with a custom dealloation policy.

Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.

If you can clarify what you mean by setting ownership to false that would be great. On first inspection it seemed the has_ownership flag was designed to be set on constructor only - at least from the RCP standpoint. So it was not clear that there was a ‘good' way to change ownership of RCP after construction - I will look into this further.

Teuchos::RCP has the ability to allow any client to remove ownership of the RCP to delete the shared managed object by calling the member function release():

https://trilinos.org/docs/dev/packages/teuchos/doc/html/classTeuchos_1_1RCP.html#a281058a75338ef844e6db99353bd6bfd

This was added 15 years ago or so to address some really messed up applications that had really ugly memory management. This allows you to change your mind about how the object gets deleted after the first RCP object is created (you can't do that with std::shared_ptr). But you pay the price of the storage of an extra bool in the RCPNode object to allow that.

Also you indicated the data would be allocated in main, then passed to the singleton. In case 3 for example, I i was initially thinking the singleton would handle the allocation. There may be subtle reasons you said it this way which I am missing - or perhaps if the main idea is to make sure the data is preserved when one ptr gives up ownership, it may not matter for this one.

It does not matter how the initial allocation is done. We just want to test the race condition to set ptr=null in each thread and make sure that the object gets deleted correctly, even in debug-mode with tracing.

Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.

I think the prior unit tests as examples and any insight you have on #3 will give me a fairly clear path on this one.

Similar to case-3 and case-4, except let an "extra object" do the deletion. Support for "extra data" was added about 14 years ago in order to address some very ugly use cases with badly structured C++ code that existed at the time.

Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.

I tried having the main thread create an RCP, send a weak ptr of that RCP to a second thread, and then the first strong ptr goes out of scope. Now the second thread can be upset as the ptr will become invalid at some unexpected time. Does this sound like the right direction for this test? Same as #3, just starting to get a feel for what the weak ptr can do (or should never do) when the strong count is 0.

Because of the uncertainty about the behavior of weak RCPs (as described above), I think we can hold off on this unit test for threaded code.

Apr 28 '16 14:04 bartlettroscoe

One thing I am struggling with is to determine exactly the scope (what things we need to solve and what things we don't).

Since we are conducting this work in an agile way, the scope does not need to be fixed. We will just do the highest priority work and then see if there is time and funding to do more.

Apr 28 '16 14:04 bartlettroscoe

Thanks - I am going to be working on this now and will respond today once I can try your ideas. It sounds like the 2nd test case for node tracing is on track and I think these notes can guide me well.

You asked about the node->delete_obj() and which test cases it fails for. To clarify, my interpretation was that this was a multithreading issue only - so initially I only looked at that new test I made which allocates and deletes nodes constantly in multithreading/debug mode. I assume the other tests I will be adding would also trigger this issue in debug mode. I think the fix resolves it but have not extensively tested yet and will keep you updated. Please let me know if that does not address your question.

Apr 28 '16 14:04 MicheldeMessieres

hi - To answer your question more explicitly - All tests seemed fine in debug with or without the new changes, except this one new test I created. The new test can have random memory problems so only seems ok with the new fixes (mutex and moving delete_obj) into the mutex lock. So I am concluding this particular fix is only relevant to Debug + Multithreading situations - and I think that makes sense for this situation.

Apr 28 '16 15:04 MicheldeMessieres

So I am concluding this particular fix is only relevant to Debug + Multithreading situations - and I think that makes sense for this situation.

Yes, of course. Thanks for pointing that out.

Apr 28 '16 15:04 bartlettroscoe

hi - here's an update.

Working on issue #3. I set up as you suggested - so a single RCP is passed to several threads, then sets to null in the main thread (while others are still running). When the threads stop they can cause a race condition and skip deleting the final object. It's rare and in my current setup I often loop the entire test thousands of times before it hits - so can try to optimize it. But I suspect we may have several rare race conditions to deal with in the delete path.

The most common one I hit and am investigating first is probably mostly happening in the unbind() function. Two threads race so initially strong count is 2, they both decrement together so count is 0, they both trigger the unbindOne(), then both increment (count is 2), then they enter unbindOne and skip delete which looks for count 1. Seems we can also have a double delete in some cases but still figuring it out.

I think deincr_count() can race internally and maybe it would be better not to have this return anything so we always check the count as a separate inline. But I think generally there is a deeper problem here which will require some control scheme to make sure the delete process is controlled by a single owner. Still working to figure out the issues and will try to make at least one solution, if not the best, to convince myself I've found all the problems.

Apr 28 '16 23:04 MicheldeMessieres

Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.

Apr 28 '16 23:04 MicheldeMessieres

Exceptional rare is not a very nice argument if you want to run for a month on a million cores. Even if it would only fail every 100 million thread hours, that means you get this thing crashing every 4 days. So I believe this has to be absolute thread safe, no short cuts allowed. We also want it efficient in release builds.

Apr 29 '16 00:04 crtrott

I agree with @crtrott, failure of our software in field reflects exceptionally badly on the lab and the project, worse, because it cannot always be detected it creates fire drill situations which pull in many people to hunt for the potential bugs. In my view everything we do should be prioritized to provide robust software before performance. Being able to depend on a piece of software is frequently the most important thing users tell us they look for when deciding what can/cannot be used with their code.

Apr 29 '16 02:04 nmhamster

I agree with @crtrott, failure of our software in field reflects exceptionally badly on the lab and the project, worse, because it cannot always be detected it creates fire drill situations which pull in many people to hunt for the potential bugs. In my view everything we do should be prioritized to provide robust software before performance. Being able to depend on a piece of software is frequently the most important thing users tell us they look for when deciding what can/cannot be used with their code.

Yes, the thread-safe reference counting has to be 100% correct and robust and can never fail. If that requires some refactoring or even breaking some backward compatibility, then that is what needs to happen. For example, as mentioned above, that may mean that we can't allow direct object access through a WEAK RCP object. However, I can think of an idiom that for usage of a possible WEAK RCP that will be fairly efficient and still allow for transparency for if an RCP is WEAK or STRONG (as apposed to std::shared_ptr and std::weak_ptr which are more rigid). (Basically, you could create an ActiveRCP stack-based class that would create a WEAK RCP when the stored RCP was WEAK but would directly use the stored RCP when it was STRONG.)

But if we break backward compatibility, then we can't make this the default until the Trilinos 13.0 release. There are a variety of ways to do that (ifdef at configure time, or template argument to RCP objects, etc.). But on this branch we just need to make the Teuchos MM classes 100% thread safe then we will refactor this to address these other concerns. Once we have passing strong tests, such refactoring should be pretty straightforward.

Apr 29 '16 13:04 bartlettroscoe

I'm completely on board and wanting to get a broader perspective how the addition of multithreaded safety changes the design of these projects. I expect its desirable to keep direct sharing of memory as high level as possible and thinking from a user perspective. I'm curious whether turning off a thread safety option to trade performance for risk would ever come up in any scenario (perhaps in a testing scenario for a research project). But it's just philosophical and from a practical point of view, I consider my task to find perfect logical solutions to these problems. Ross, I'll need to reflect on your statement about weak/strong some more but all these points are very useful. Thanks.

Apr 29 '16 13:04 MicheldeMessieres

Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.

The strategy would be to make it correct first (without using a mutex everywhere), and then refactor to make it faster once we have all the strong unit tests in place. I think it would be okay initially to put in a mutex when the RCP decided it wanted to delete the shared object. That mutex would be amortized against the dynamic memory management and most programs should not be constantly creating and deleting little objects so that overhead might be acceptable (but timing will tell). What we can't afford is a mutex for every reference count update (which was the initial proposal).

Apr 29 '16 13:04 bartlettroscoe

I'm curious whether turning off a thread safety option to trade performance for risk would ever come up in any scenario

I think there might be cases. To support that, we might have an optional template argument that determines if the MM objects are thread-safe or not. The default (at Trilinos 13.0) should be thread-safe, but we could allow some client code to create non-threadsafe MM objects if they know their usage is not going to be involved in multi-threading. That will complicate the design, testing, and maintenance of the Teuchos MM classes, but it might be worth it. We will have to see.

But it's just philosophical and from a practical point of view, I consider my task to find perfect logical solutions to these problems.

That is a good goal but often it takes many iterations to get there. The first iteration should be just to make the reference counting 100% correct and robust with very strong unit testing. Then in later iterations we can refactor to make faster and clean up the code. With the strong unit tests for thread-safety in place, these refactorings will be very safe to perform.

Apr 29 '16 13:04 bartlettroscoe

Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)? That scheme is fairly fast and well tested.

Apr 29 '16 14:04 crtrott

Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)? That scheme is fairly fast and well tested.

Does that scheme have node tracing implemented with debug-mode checking for dangling references? That is a big part of helping developers and users debug problems with their usage of the MM classes. In an optimized build, there is no node tracing (and therefore no checking for invalid usage). That is the main value of Teuchos RCP over std::shared_ptr. For example, there are no tools in std::shared_ptr to help you track down circular references between objects.

Apr 29 '16 14:04 bartlettroscoe

Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)?

We could do that, as long as we don't actually make Teuchos have a required dependency on Kokkos. Other projects that do not currently require C++11 depend on Teuchos and use Teuchos::RCP etc.

Apr 29 '16 18:04 mhoemmen

Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.

Can you compute a probability, and the uncertainty in the probability? Remember that this is a function of

the hardware memory model (x86, POWER, ...)
an operating system version
current load on the machine
the number concurrently executing threads
the workload of those threads

If you can't, then don't say "normal use" or "rare."

Apr 29 '16 18:04 mhoemmen

Making progress and resolved test 3. Now that I understand the flow better this can be resolved (for strong count only or weak count only) with some trivial restructuring. Still have an issue with mixed weak/strong counts which I hope to be working on soon.

I have a more general question about unit tests and acceptable timing. I've been trying to optimize these tests so they can trigger the race conditions more efficiently. Spin locking all the threads and releasing them simultaneously helps a lot but I'm still looking at several seconds to reliably trigger this (on my mac). I was wondering if spin locking all the test threads should be definitely avoided for any reason?

As has been pointed out, this will be architecture dependent as well. It may be difficult to make the test work in a suitable time frame. For now I am not going to worry about this too much and progress with getting things in place, but I am interested in how we may prefer to handle tests if they require longer run times to trigger a fail event.

Thanks.

May 02 '16 19:05 MicheldeMessieres

Making progress and resolved test 3. Now that I understand the flow better this can be resolved (for strong count only or weak count only) with some trivial restructuring.

Excellent!

I have a more general question about unit tests and acceptable timing. I've been trying to optimize these tests so they can trigger the race conditions more efficiently. Spin locking all the threads and releasing them simultaneously helps a lot but I'm still looking at several seconds to reliably trigger this (on my mac). I was wondering if spin locking all the test threads should be definitely avoided for any reason?

Don't worry about how long they take right now, as long as it is not excessive (i.e. a minute or more). We have different test categories that we can put on tests so we may only run the more expensive tests in NIGHTLY testing (see the CATEGORIES NIGHTLY).

As has been pointed out, this will be architecture dependent as well. It may be difficult to make the test work in a suitable time frame. For now I am not going to worry about this too much and progress with getting things in place, but I am interested in how we may prefer to handle tests if they require longer run times to trigger a fail event.

If we need to, we can tweak the parameters for the most detailed testing based on the architecture (CMake can tell use something about this) or we can just run the most detailed testing on a few known nightly testing machines to avoid guessing.

May 02 '16 22:05 bartlettroscoe