Thread parking for task_queue
I've had a go at implementing the thread parking idea. I've moved most of the data members (other than the actual queue<function<void()>>) out of the task_queue class and into a task_queue_thread class. The destructor of the task_queue class now "parks" the task_queue_thread instance that it owned in a static pool; and the task_queue constructor takes a task_queue_thread out of that pool in preference to constructing a new one. The parking mechanism is done in parallel to the existing stopping mechanism.
~~However it appears that there's a deadlock of some sort occurring, as the unit tests time out after showing little to no CPU usage. I'll investigate more tomorrow.~~
NB. In common with #432 this PR has fork_island disabled.
This now gives a 15-20% speedup for my workload
@ow97 sorry for taking longer than anticipated.
I think I understand the idea, and the general approach is sound, but I am a bit lost in the implementation details. Unfortunately this is a part of the code which is really critical to the well-functioning of pagmo, and it's even riskier than usual to modify because of the multi threading aspect. I would be more comfortable if we managed to simplify the implementation while proceeding with more gradual steps.
In this spirit, I have opened #435 with some (long overdue) simplifications for the task_queue class. These are just simplifications (and some C++17-isms), there's no new functionality in there. Let me know if you have questions about it.
For implementing the thread parking functionality, I would propose the following (based on the ideas in your PR):
- modify the island class to store unique ptrs to a
task_queue(instead oftask_queueobjects directly). This would be in preparation for the parking functionality: instead of having a separatetask_queue_threadclass, let's just directly deal with pointers to atask_queue, so that we can store existing queues in the cache and re-use them as needed (via a factory function, as done in your PR); - the
m_stopmember of the queue can become an enumm_status(in the same fashion as done in your PR) with a bunch of different possible queue states; - try to implement the parking functionality without adding new condition variables, just re-using the existing one in
task_queue. We can discuss in more detail about this on gitter if you like.
As an additional comment, instead of using the TBB concurrent queue, I'd rather have a first implementation with just a std::vector of unique ptrs to task_queue paired to a mutex, for implementing the cache. If we see this becomes a bottleneck, we can always switch to a better performing solution.
Let me know what you think about this plan, and thanks again for taking the time to look into this.
@ow97 sorry for taking longer than anticipated.
No worries. I myself have a busy couple of weeks coming up, so I'll have to come back to this in a little while.
I think I understand the idea, and the general approach is sound, but I am a bit lost in the implementation details. Unfortunately this is a part of the code which is really critical to the well-functioning of pagmo, and it's even riskier than usual to modify because of the multi threading aspect. I would be more comfortable if we managed to simplify the implementation while proceeding with more gradual steps.
I completely understand your reservations.
In this spirit, I have opened #435 with some (long overdue) simplifications for the
task_queueclass. These are just simplifications (and some C++17-isms), there's no new functionality in there. Let me know if you have questions about it.
Looks good to me :+1:
For implementing the thread parking functionality, I would propose the following (based on the ideas in your PR):
* modify the island class to store unique ptrs to a `task_queue` (instead of `task_queue` objects directly). This would be in preparation for the parking functionality: instead of having a separate `task_queue_thread` class, let's just directly deal with pointers to a `task_queue`, so that we can store existing queues in the cache and re-use them as needed (via a factory function, as done in your PR); * the `m_stop` member of the queue can become an enum `m_status` (in the same fashion as done in your PR) with a bunch of different possible queue states; * try to implement the parking functionality without adding new condition variables, just re-using the existing one in `task_queue`. We can discuss in more detail about this on gitter if you like.
I think I prefer this architecture to the one I ended up implementing, especially given this avoids the need for both a task_queue and task_queue_thread class.
As an additional comment, instead of using the TBB concurrent queue, I'd rather have a first implementation with just a
std::vectorof unique ptrs totask_queuepaired to a mutex, for implementing the cache. If we see this becomes a bottleneck, we can always switch to a better performing solution.
I'd be interested to hear more about your thinking on this. For my workload the number of islands being created meant that significant time was spent waiting for the mutex on the container; the lock free TBB queue improved performance by 2-5%.
Let me know what you think about this plan, and thanks again for taking the time to look into this.
Sounds like a good plan to me. I'll get an updated PR sorted hopefully in a couple of weeks time. I've no objections to #435 being merged in the meantime.
I'd be interested to hear more about your thinking on this. For my workload the number of islands being created meant that significant time was spent waiting for the mutex on the container; the lock free TBB queue improved performance by 2-5%.
Ok that's a good datapoint to know.
My reservation came from reading the docs of the TBB concurrent_queue class and not understanding exactly what kind of semantics it provides for the try_pop() function. The docs here
https://software.intel.com/en-us/node/506200
say
If value is available, pops it from the queue, assigns it to destination, and destroys the original value. Otherwise does nothing.
and I could not really understand what was going on with that assignment. A copy assignment could not possibly work for std::unique_ptr, so somehow a move is happening instead, but it's not clear from the docs. I think this is related to the build failure we are seeing on circleci with clang 6, where it complains about a deleted assignment operator.
In any case I think we can get to the bottom of this and use a lock-free data structure (either TBB or perhaps even Boost's, if TBB proves problematic).
Sounds like a good plan to me. I'll get an updated PR sorted hopefully in a couple of weeks time. I've no objections to #435 being merged in the meantime.
:+1:
@bluescarni I've implemented your suggestions, and everything seems to be working apart form the fork_island. Any ideas on what needs doing to get that working?
So as you'll see I've tried out the following thread park data structures:
-
tbb::concurrent_queue- Causes the annoying build failure under Clang 6 -
boost::lockfree::queue- Requires a trivial destructor, so we can't useunique_ptrand my manual deleter implementation was insufficient to avoid SEGFAULTs -
std::queueprotected by astd::mutex- Works but with a performance hit of 2-5% compared totbb::concurrent_queuefor my workload.
I've also implemented a fix for fork_island though if you have suggestions for a cleaner fix I'd be happy to take them onboard.
@ow97 I think this is going in the right direction.
I have two main comments:
- if I understand correctly, the problem with
fork_islandis the destruction of the static queue of task queues which creates a deadlock. Could you try, infork_island.cpp, to exit the remote process not withstd::exit()(as it is done now) but withstd::_Exit()? This should ensure that no destructors of static objects are called:
https://en.cppreference.com/w/cpp/utility/program/_Exit
I would prefer to adopt this solution (if it works) in conjunction with a global queue object to be declared directly in task_queue.cpp in an anonymous namespace, rather than the current approach of using the get_park_q() helper.
- I am a bit concerned about the fact that the implementation of the destructor of
island_datahas changed not to do any waiting, if I am reading the code correctly. I.e., whereas previously destroying an island forced ajoin()on the island thread, this is not the case any more. This is a rather big behavioural change and I am unsure about its implication. I would prefer if the implementation waited to consume all the queued task of an island before the destructor returns.
Regarding the performance penalty of using std::mutex, perhaps you could try to see if replacing it with a spinlock from TBB improves the runtime behaviours? Just an idea...
Thanks as ever @bluescarni
if I understand correctly, the problem with
fork_islandis the destruction of the static queue of task queues which creates a deadlock.
I'll give that a try. I'd also be keen to try just using raw pointers instead of unique_ptrs, meaning that destruction of the static std::queue doesn't call the destructors of the task_queues. This approach would avoid any issues future issues with other static resources not being destructed on exit of the child process. What do you think?
I would prefer to adopt this solution (if it works) in conjunction with a global queue object to be declared directly in
task_queue.cppin an anonymous namespace, rather than the current approach of using theget_park_q()helper.
I'm happy to go with either solution. I'm in the habit of using the COFU idiom for situations like this to avoid possible (in my experience often subtle) initialisation order problems, even if it is less visually appealing.
I am a bit concerned about the fact that the implementation of the destructor of
island_datahas changed not to do any waiting.
Yes I agree this is a change in behaviour. I'm happy to ensure that we maintain the old behaviour. I hadn't picked up on this as none of the unit tests had flagged it up, my mistake.
I'll give that a try. I'd also be keen to try just using raw pointers instead of
unique_ptrs, meaning that destruction of the staticstd::queuedoesn't call the destructors of thetask_queues. This approach would avoid any issues future issues with other static resources not being destructed on exit of the child process. What do you think?
I think that would work (modulo memory "leaks" on shutdown that may be annoying to ignore in the memory-checking CI builds), but I also think that we should not bend backwards for fork_island. This island exists only as a hackish workaround for problems/algorithms which are thread unsafe. Because of its limitations, and because it is not portable, perhaps the best solution (eventually) would be to just replace it with something like an mpi_island instead.
I'm happy to go with either solution. I'm in the habit of using the COFU idiom for situations like this to avoid possible (in my experience often subtle) initialisation order problems, even if it is less visually appealing.
While I agree in the general case, my impression in this specific case is that we don't have dependencies between static variables to worry about. But if you prefer returning a function-local static, that works for me too.
Yes I agree this is a change in behaviour. I'm happy to ensure that we maintain the old behaviour. I hadn't picked up on this as none of the unit tests had flagged it up, my mistake.
No mistake, we don't really check explicitly for this.
I think that one issue that could occur with the new behaviour is that the task queue ends up running evolutions after an island is destroyed. If that happens, then the tasks will try to copy the evolved populations back to the island, which will result in a crash because the island does not exist any more.
memory "leaks" on shutdown that may be annoying to ignore in the memory-checking CI builds
@bluescarni are you happy that this doesn't introduce too much noise based on the last CI runs?
I think that one issue that could occur with the new behaviour is that the task queue ends up running evolutions after an island is destroyed
This should now be resolved.
@ow97 I have a couple of comments:
- I am a bit surprised that the builds with the address sanitizer are not reporting leaks. Now I am unsure whether we are checking for leaks at all or only for invalid memory accesses. In any case, it seems to me like a possible strategy for avoiding leaking the naked pointers in the global queue would be to wrap the queue in a holder struct and then define the struct destructor to clean up the pointers. E.g., something like:
struct queue_holder
{
boost::lockfree::queue<task_queue *> m_queue;
~queue_wrapper()
{
// Do the cleanup here.
}
};
Because the global queue is an object with static storage duration, its destructor will be called after the destruction of all the islands and thus it should just be enough to just delete the task queue objects.
In order to accommodate fork_island, there should probably be a helper function that allows you to clear the m_queue member so that we avoid the cleanup in the remote process.
-
Do we need to have two different condition variables? Cannot we use the same condition variable for signalling stopping and parking?
-
Is there a specific reason to use
notify_all()rather thannotify_one()?
In any case, it seems to me like a possible strategy for avoiding leaking the naked pointers in the global queue would be to wrap the queue in a holder struct and then define the struct destructor to clean up the pointers.
:+1:
Do we need to have two different condition variables? Cannot we use the same condition variable for signalling stopping and parking?
We're communicating about 2 different events in 2 different directions, so by my understanding we can either use 1 condition variable and use notify_all everywhere (and accept some extra spurious wakeups), or we could use 2 condition variables (one for each type of event) and use notify_one safe in the knowledge that there should only be a single waiting thread. Either way I agree the inconsistency needed clearing up,
@ow97 Sorry for the delay...
Here are my comments:
- it seems like the
m_parkmember is not initied by thetask_queueconstructor? - in
unpark_or_construct()thetq->m_park = false;write operation is not protected by any mutex. It seems to me that there is no danger of concurrent read/writes by other threads, but for consistency it would be better to have all operations on a queue to be protected by its mutex. - I don't think it's necessary to use
boost::functionin the destructor of the lockfree queue wrapper, I think it's enough to pass in a lambda?
I think we are getting there. I still feel like it would be better to have a single enum encapsulating the current state of the queue rather than the two m_stop and m_park members. I am unsure if/how that would complicate things though.
@ow97 Sorry for the delay...
No worries!
it seems like the
m_parkmember is not initied by thetask_queueconstructor?
Agreed. Resolved.
in
unpark_or_construct()thetq->m_park = false;write operation is not protected by any mutex. It seems to me that there is no danger of concurrent read/writes by other threads, but for consistency it would be better to have all operations on a queue to be protected by its mutex.
Agreed about the safety. Agreed about consistency. Changed.
I don't think it's necessary to use
boost::functionin the destructor of the lockfree queue wrapper, I think it's enough to pass in a lambda?
You're quite correct. Changed.
I think we are getting there.
You're a very patient person! I'm glad to hear you can see light at the end of the tunnel :) Thanks for your continued help.
I still feel like it would be better to have a single enum encapsulating the current state of the queue rather than the two
m_stopandm_parkmembers. I am unsure if/how that would complicate things though.
See https://github.com/esa/pagmo2/pull/433/commits/701a84230f6dc4f0dcb9c5d9cfbaf4b71e7681fd for my take on this suggestion.
We got there in the end. I shouldn't be so eager to commit late at night.
@bluescarni ready for review.
@bluescarni how close do you think we are to something worth merging?
Hi @bluescarni, I've been developing against a build of this branch since July with no issues. What are your feelings about merging this?
@ow97 sorry that this is taking so long, we are wrapping up work on another project that took most of our coding time in the last few months.
As soon as we restart feature development for pagmo, I'll make sure that we land this.
@ow97 sorry for taking so long getting back to this...
I have an alternative implementation of this feature up at #512, it re-uses many of the ideas here but it also differs in two main aspects:
- there is no additional condition variable and no additional queue statuses. Rather, there is a new
wait_all()method that blocks until all tasks in a queue have been consumed. After this method is invoked from theisland_datadestructor, the queue is pushed into the global cache; - there is a new mechanism to automatically clean up the global queue cache in the child of a forked process (rather than ad-hoc cleanup action performed within
fork_islandfor this purpose). This means that (in theory) all uses offork()(e.g., from Python when using the multiprocessing module) should be safe wrt the global queue cache (I still haven't tested though in practice).
I would be glad if you could take a look at/test the new PR, in particular with respect to the usage of a single condition variable. I do believe that the queue thread and an external thread can never be waiting on the same condition variable at the same time, and thus the usage of notify_one() should be ok, but I would love to hear your opinion on the matter.