How dispatch_barrier_async works
…and why I think it cannot be reproduced with a (simple) subclass of NSOperationQueue.
Luckily we can see the source code of libdispatch at http://opensource.apple.com/source/libdispatch/libdispatch-187.9/ and see how Apple implements barriers in libdispatch.
If we go to src/queue.c and look for the methods dispatch_barrier_async and dispatch_barrier_async_f we can see that each block introduced in a queue is stored in a dispatch_continuation_t private type with two flags. The interesting one is obviously DISPATCH_OBJ_BARRIER_BIT. That flag is used in several other functions, like the obvious dispatch_barrier_sync internal private functions, and another point that interest us.
The usage is inside the function _dispatch_queue_drain, which seems to be the function that selects the next dispatch_continuation_t structure to be processed in a queue. In that function, if the continuation has the BARRIER_BIT set, and the queue is running (meaning that some block is being executed), the function exits without executing that block (it does some clean up first). I cannot find exactly how, but this function will probably be called again later in time when the current block being executed finish, doing something like a spin lock. If nothing is running at that moment, the continuation is popped from the list, the work count incremented and the barrier block will be executed (inline in the thread that is executing drain). Meanwhile the normal blocks fall through _dispatch_continuation_redirect, which searches an alternative “queue” in which one can enqueue the block (I’m not sure, but I think everything ends up in the global queues). (one clever thing is that for serial queues, dq_width == 1, the blocks are also executed inline in the thread that is running drain).
The interesting thing is how the barrier is obtained: to execute the barrier the queue must not be running blocks, and since the barrier is run inline in the draining code, no other block can be run for that queue (no other thread tries to process the queue while dq_running cannot be set from 0 to 1 with an atomic operation).
Intermission
How your code works? Dependencies. When a barrier is added to the queue, every operation that is not running (even if the comment says “currently executing operation”) is added as a dependency of the barrier. This will make the operation to run after all the rest of the operations. At the same time any operation added to the queue will take as dependency every barrier operation already in the queue.
Could this work? Probably yes, at least I cannot find an obvious error, even if you have lost the choice of using priorities (at least nothing can jump over a barrier set before that operation), and operation ready states can make the barrier wait longer than necessary. Let’s say that those points are not “important” to this exercise, but the second one is kind of important for my actual problem (the operations need credentials, the “barrier” is the authentication that obtains the credentials).
My solution follows a similar pattern, but not using dependencies (I normally don’t like to use a feature the user of my class might use without knowing that is being used internally for something else). The “barrier” suspends the queue, cancels all operations, take the internal state of the operations and stores it, and then in the completion block of the “barrier”, reenqueues all the pending operations. If an operation is enqueued in the mean time, we know it cannot be ready, since the “barrier” is the one that will set it as ready. But I don’t like this implementation a bit.
Now, seeing the code of libdispatch, and idea came to my head: the trick is having two queues, let’s call them drainer and invoker. The drainer will be the public interface, while the invoker will be the hidden queue. When you invoke addBarrierOperation or addOperation to the drainer an internal operation is created instead, that references the user operation. This internal operations must “copy” as much as possible of the state of the referenced operations: priority, readiness (KVObserving the readiness of the referenced operation). If I’m not mistaken, the dependencies should be handled with the readiness observing (if the implementer haven’t mis-overriden the isReady value). The drainer will have a maxConcurrentOperationCount of 1 (serial queue), obtaining the desidered concurrency using the invoker queue (the public interface should proxy the maxConcurrentOperationCount to the invoker, while keeping drainer at 1). When one of this non-barrier internal operations are executed in the drainer it will simply enqueue the operation in the invoker. If the operation in the drainer is a barrier, we invoke waitUntilAllOperationsAreFinished in the drainer thread, and then execute the barrier referenced operation in the invoker using addOperations:@[userOperation] waitUntilFinished:YES (that way it will run in the same thread/queue than the others, if that’s important). When the barrier is finished, the rest of the operations can keep passing (briefly) through the drainer and being run in the invoker.
I don’t know, it might work (even if it is a lot of hassle for a little thing). The only thing I’m not happy about the solution is that it breaks the +currentQueue method, but maybe switching drainer and invoker could fix that and don't break anything else. Right now I cannot implement the proof of concept, but maybe I can find some time for the next revision, or some free weekend.
Thanks for the inspiration to look this further.
Hey Dani! Sorry that it took me so long to reply!
First of all I appreciate it so much that you took the time to write such a complete issue. When I wrote JSBarrierOperationQueue, pretty much like when I write any open-source code, I wasn't expecting to make somebody's life immensely better, but if just by reading my code someone gets an idea (or feels like digging into the problem a little bit more, like in your case), and thanks to that they learn something: I'm happy.
Also, I had no idea that libdispatch was open source! So thanks so much for pointing that out.
Could this work? Probably yes, at least I cannot find an obvious error...
I actually implemented this class doing TDD, because first I wanted to think what the use cases for it would be, then I wrote the tests, and then implemented the class to make the tests pass. If you look at the test cases, they're really simple, but I think that's all the class has to do, and they work... So to me that was good enough :)
and operation ready states can make the barrier wait longer than necessary..
I don't understand this one. Using dispatch_barrier it could also happen that the currently running blocks make the barrier block take very long to start executing, but that's not something the dispatch_barrier contract promises that it would do.... But maybe I didn't understand you properly.
I normally don’t like to use a feature the user of my class might use without knowing that is being used internally for something else...
Totally agree with you here, in fact, that's in my opinion the weakest point of my implementation. But I think that's just a matter of cleanness, and not a real issue in practice.
The solution that you suggest is at least interesting to try. My first mental-approach to this problem, was something similar, but simpler: basically having a separate NSMutableArray with operations, and avoid passing them to the actual queue until they really should be scheduled. Bit hacky, you would have to override operations and operationsCount to lie. So I thought this solution, even if not ideal, was pretty clean.
Could this work? Probably yes, at least I cannot find an obvious error...
I actually implemented this class doing TDD, because first I wanted to think what the use cases for it would be, then I wrote the tests, and then implemented the class to make the tests pass. If you look at the test cases, they're really simple, but I think that's all the class has to do, and they work... So to me that was good enough :)
Not to be a party pooper, but since we are trying asynchronous code here, and your tests look really solid, you cannot be sure that the implementation is problem-free (specially if we start thinking that multithreaded environments are the target environments for the code). Anyway, I think the code will work (the tests prove that), at least a good long time before a strange concurrency/contingency problem arises… and then fun, fun, fun :P
and operation ready states can make the barrier wait longer than necessary..
I don't understand this one. Using dispatch_barrier it could also happen that the currently running blocks make the barrier block take very long to start executing, but that's not something the dispatch_barrier contract promises that it would do.... But maybe I didn't understand you properly.
Ok, here we have a basic difference between dispatch queues and operation queues. The former are guaranteed to always run the blocks you send to them, and they will start them in the exact same sequence as you have send them. Operation queues are much more advanced, since you can define dependencies, priorities and ready states, the order of the operations are not guaranteed to be the same (is not usually a FIFO, only under some circumstances).
What I mean by the ready states is the following: imagine a queue with concurrency 2, and we enqueue the operations A, B and C. Both three operations have a ready method that checks an external resource (let’s imagine is a boolean). This boolean starts as YES, so the A and B operation start running in the queue. In the middle of its life, the A operation detects and error, set the external resource to NO and enqueues the barrier operation. The barrier operation, by your design, will have as dependency C, which cannot start because the external resource say it so. The problem is that this barrier operation is the one that resets the state of the external resource and sets the flag back to YES. I think that in that case you will have a classic interlocking between C and the barrier: the barrier is waiting for C to finish, and C is waiting for the barrier to reset the flag.
The solution that you suggest is at least interesting to try. My first mental-approach to this problem, was something similar, but simpler: basically having a separate NSMutableArray with operations, and avoid passing them to the actual queue until they really should be scheduled. Bit hacky, you would have to override operations and operationsCount to lie. So I thought this solution, even if not ideal, was pretty clean.
If you synchronize the access to this array the idea should be similar to the queue, yes. I choose the queue because it gives you all concurrency guarantees that I needed, instead of me having to remember to synch the code (or code myself a atomic linked list, which I don’t want to, either).
Not to be a party pooper, but since we are trying asynchronous code here, and your tests look really solid, you cannot be sure that the implementation is problem-free (specially if we start thinking that multithreaded environments are the target environments for the code). Anyway, I think the code will work (the tests prove that), at least a good long time before a strange concurrency/contingency problem arises… and then fun, fun, fun :P
So you mean that I'm doing things correctly in my tests, but if you do something wrong, it could work wrong? Sounds like it wouldn't be JSBarrierOperationQueue :P
What I mean by the ready states is the following: imagine a queue with concurrency 2, and we enqueue the operations A, B and C. Both three operations have a ready method that checks an external resource (let’s imagine is a boolean). This boolean starts as YES, so the A and B operation start running in the queue. In the middle of its life, the A operation detects and error, set the external resource to NO and enqueues the barrier operation. The barrier operation, by your design, will have as dependency C, which cannot start because the external resource say it so. The problem is that this barrier operation is the one that resets the state of the external resource and sets the flag back to YES. I think that in that case you will have a classic interlocking between C and the barrier: the barrier is waiting for C to finish, and C is waiting for the barrier to reset the flag.
I think this simply means that you should be aware of what happens when you use JSBarrierOperationQueue, to understand what could cause a deadlock. Fair enough.
Not to be a party pooper, but since we are trying asynchronous code here, and your tests look really solid, you cannot be sure that the implementation is problem-free (specially if we start thinking that multithreaded environments are the target environments for the code). Anyway, I think the code will work (the tests prove that), at least a good long time before a strange concurrency/contingency problem arises… and then fun, fun, fun :P
So you mean that I'm doing things correctly in my tests, but if you do something wrong, it could work wrong? Sounds like it wouldn't be JSBarrierOperationQueue :P
Sorry, that paragraph was messed up in a rewrite. What I wanted to say was “we are trying asynchronous code here, and even if your tests look really solid, you cannot be sure that the implementation is problem-free”. This is inherent to concurrent programming, there is nothing wrong with it. Is like the old Knuth quote, but in reverse “Be careful about using the following code – I've only proven that it works, I haven't tested it.” Yours is tested (beautifully, I must add), but you haven’t proved it right.
What I mean by the ready states is the following: imagine a queue with concurrency 2, and we enqueue the operations A, B and C. Both three operations have a ready method that checks an external resource (let’s imagine is a boolean). This boolean starts as YES, so the A and B operation start running in the queue. In the middle of its life, the A operation detects and error, set the external resource to NO and enqueues the barrier operation. The barrier operation, by your design, will have as dependency C, which cannot start because the external resource say it so. The problem is that this barrier operation is the one that resets the state of the external resource and sets the flag back to YES. I think that in that case you will have a classic interlocking between C and the barrier: the barrier is waiting for C to finish, and C is waiting for the barrier to reset the flag.
I think this simply means that you should be aware of what happens when you use JSBarrierOperationQueue, to understand what could cause a deadlock. Fair enough.
Yes, of course… but nobody reads documentation until is too late :D
As I have just said, when I have some time, I will try to re-implement using the ideas from libdispatch, re-apply your tests and see what can I come with.
Yours is tested (beautifully, I must add)
Thanks :D
but you haven’t proved it right.
And you're right, there could be edge cases. The TDD philosophy would prohibit you from writing a new implementation, until you can come up with a test that fails under my implementation though :D I know this is asynchronous code though, and tests can't always ensure something is working well, all the times. But I think you could make it fail by trying to do something not so standard :)
Yes, of course… but nobody reads documentation until is too late :D
Yeah :P Maybe I should add a paragraph on the README warning about the things that could go wrong.
As I have just said, when I have some time, I will try to re-implement using the ideas from libdispatch, re-apply your tests and see what can I come with.
I'd love to see that :)