Fix race conditions with commitsInProgress counter
Hi @belugabehr . Can you provide some context? Perhaps a description of the nature of the bug being fixed? This will help for historical reference, as well as provide context to aid reviewers. Thanks.
@ctubbsii Yes. Tablet appears to be a 'hot' lock. I am actually looking at that class but it's a bit of a beast. Anyway we can remove these external Tablet locks will make it easier later.
This class is a good candidate to improve:
- I believe the lock on
Tabletis superfluous - Update to use Java Concurrency classes (as an example for the rest of the code base)
- I do not trust this code because
commitsInProgressis accessed by different threads but is notvolatile. In particular, it may be the case thatwaitForCommitsToFinishloops forever because the thread has cached the value ofcommitsInProgress decrementCommitsInProgressandincrementCommitsInProgressare not synchronized so incrementing and decrementingcommitsInProgressseems suspect
I say that the lock on Tablet is superfluous here because the Tablet is being notifyAll() when commitsInProgress is zero. The thing is that the only code that cares about this notifyAll() is in waitForCommitsToFinish. The commitsInProgress value is private to this class and only accessed in the local method waitForCommitsToFinish. So, all the threads locking on Tablet will be awoken here, but none of them have any way to respond to this event, except for waitForCommitsToFinish, so why bother waking up every thread? Just wake up the threads waiting on commitsInProgress to be zero.
The code is very complicated for Tablet, but the conceptual model is very simple. There is a single lock that protects most (there are some exceptions) of a tablets critical in memory data. A single lock is very easy to reason about. Multiple locks are hard to reason about. Adding multiple locks piecemeal w/o looking at the bigger picture is not a good idea IMO, I think a complete redesign for Tablet concurrency would be needed.
This code currently expects the Tablet lock to be held when the method is called. Currently commit sessions are rotated on minor compaction. The single tablet lock makes this pointer manipulation and access to an individual commit session consistent. Without the single lock one would need to carefully ensure this consistency is maintained. If that consistency is not maintained it would lead to many bad things.
Looking into this a bit I think a nice solution for the current code would be to add the following to methods that expect the Tablet lock to be held. Adding these checks nicely documents the expectation and causes the code to fail at runtime if the expectation is not met.
Preconditions.checkState(Thread.holdsLock(tablet);
For this particular code it would be :
Preconditions.checkState(Thread.holdsLock(committer));
Update to use Java Concurrency classes (as an example for the rest of the code base)
I think this could be a nice solution long term solution. Create a Java lock object in tablet and pass it around, making things very explicit in the code, making the code easier to understand. Java's reentrant lock has the method isHeldByCurrentThread() that could be used for precondition checks. Still having a single lock makes reasoning about concurrency correctness easier.
@keith-turner Thanks for the input.
So, the downside with your proposed solution is that the locking becomes very granular. That is to say, if a method needs to be called with the lock in hand, then any initialization of the method happens withing the context of the lock. If the lock is internal to each class, locks can be acquired as-needed.
Also, typically, I use isHeldByCurrentThread() in private scoped methods to ensure the lock is not lost with a future change.
You can check out my work on the matter here:
https://github.com/belugabehr/accumulo/tree/TabletLock
So, the downside with your proposed solution is that the locking becomes very granular. That is to say, if a method needs to be called with the lock in hand, then any initialization of the method happens withing the context of the lock. If the lock is internal to each class, locks can be acquired as needed.
Another possible path would be to move all of tablets in memory state (list of files, active scans, list of bulk imported files, list of write ahead logs, compaction state, etc) into a single class with methods for mutating the state. All concurrency control over this state would be in this single class and hopefully this makes it clear when multiple variables need to be mutated atomically. Having the data and concurrency protections for the data all in one place would make it much easier to reason about. The class would contain no code for acting on the data (like the code for actually scanning a tablet would not be in this class).
For this particular PR it's still not clear to me why CommitSession.java needs to lock (block) the entire tablet to keep this internal count.
For this particular PR it's still not clear to me why CommitSession.java needs to lock (block) the entire tablet to keep this internal count.
Its not that the tablet is locked just to mutate this counter. The tablet is locked because it is reading/writing multiple variables that need to be considered together atomically (for example it needs to get a pointer to the correct commit session and ensure the tablet state is not closed). There are two places in the code where the lock is acquired and the count is mutated along with reading/writing other tablet state. Below is link to those two places.
https://github.com/apache/accumulo/blob/40d954f22b22ff437b938c6bf3607298b30e1088/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java#L1179
https://github.com/apache/accumulo/blob/40d954f22b22ff437b938c6bf3607298b30e1088/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java#L1301
Even if this code has its own lock, it would still be mutated within the tablet lock.