accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Perform lease recovery for wasb filesystem

Open billierinaldi opened this issue 6 years ago • 8 comments

This patch passes a basic test of writing 1 k/v, killing accumulo, and restarting accumulo. I am not sure if it is correctly handling the case where acquireLease throws an exception.

billierinaldi avatar Dec 04 '19 17:12 billierinaldi

One thing I'm not 100% sure about is the relationship of semantics from AzureBlobStore and HDFS. They both have these things we call "leases", but are their semantics the same?

I can see that for HBase, lease recovery is only done for directories set up for "atomic rename" (aka configured with Page Blobs) https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/AzureNativeFileSystemStore.java#L445-L459. However, it seems like this is just an implementation detail for HBase (not that lease recovery requires Page Blobs to be used). I base this finding on https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.storage.blob._cloud_blob.acquirelease?view=azure-java-legacy not saying anything about page/block blobs.

From the presentation I put together the other week, I don't recall Accumulo having any WAL renames that would need to be atomic (in contrast to HBase which moves the WAL once a RS starts working on it).

The above aside: we do want to make sure that fencing the WALs still works for Accumulo, to prevent zombie'd Tservers from causing a ruckus. One thing I haven't been able to figure out is if the following scenario will work as we want it to:

  • Tserver is in half-dead state, not talking to Master, but is able to keep a lease with ABFS (renew'ed every 40s by default)
  • Master whacks that ZNode for the tserver to try to kill it (normally, a max of 60s, but maybe longer until we notice if things are really messed up), and starts reassigning things
  • Tserver doesn't yet observe ZK change, and is still renewing the lease with ABFS
  • A new TServer calls acquireLease() on this WAL which is still being renewed by the half-dead Tserver.

I can't figure out from docs/code what the expected outcome of this action is. Does it work like HDFS works (the TServer making the acquireLease() call "overriding" the old lease that the zombie tserver is holding)? Maybe you can find someone on the storage team at Azure to ask about the semantics of https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.storage.blob._cloud_blob.acquirelease?view=azure-java-legacy. If not, maybe you can just write a test which simulates the above happening (makes sure that an old client who once held the lease, can no longer append to a file after another clietn called acquireLease).

Assuming the semantics for acquireLease in ABFS are the same as recoverLease in HDFS, I think your change is fine. I'm lamenting the "ugliness" of the if/elseif/else conditional block in LogCloser, but it's not the end of the world.

joshelser avatar Dec 04 '19 19:12 joshelser

This change may require the hadoop-azure jar on the runtime classpath even when not using Azure. The code ns instanceof NativeAzureFileSystem in HadoopLogCloser may cause that class the be searched for on the classpath.

Maybe the following will avoid it.

  • Mark dep optional in pom
  • Create a new class called AzureLogCloser
  • Document how user can configure Accumulo to use the log closer.

With this approach the class AzureLogCloser will never be loaded unless configured.

keith-turner avatar Dec 04 '19 20:12 keith-turner

Thanks for the reviews, everyone. I'll make these changes and find out if anything else needs to be done for lease recovery in the wasb case.

billierinaldi avatar Dec 04 '19 23:12 billierinaldi

I did some additional testing and found that I could acquire a lease on a walog manually while the walog was active with the tserver. When the tserver tried to write to the walog while I had the lease, the write failed and the tserver started a new walog. But if I refrained from writing while I held the lease, the tserver was able to write to the walog I had leased after the lease expired.

This seems to mean the tserver is not acquiring a lease when writing to a walog (which might be good, because the wasb FS implementation does not provide a way to break an existing lease). I am wondering if we need to hold the lease longer after it is obtained in the LogCloser, to prevent a half-dead tserver from continuing to write to the walog, or if there would be a better way to handle this situation.

billierinaldi avatar Dec 05 '19 15:12 billierinaldi

One possible course of actions is that when a tserver gets an error writing to a walog, it does a strong check (not using ZooCache) to Zookeeper to ensure it still holds it lock. If we could detect lease loss on the tserver, the check could be done then.

keith-turner avatar Dec 05 '19 16:12 keith-turner

I will likely abandon this particular change until one of the file system drivers supports lease operations. However, I am still thinking about how we could simulate a zombie tserver situation to validate that a filesystem (and our use of that filesystem) is behaving as we expect upon lease recovery. I could implement an AzureLogCloser that does nothing when close is called, and this would probably not cause issues for recovery on a system with tservers that die cleanly. Does anyone have ideas for things we could do to test out tserver failure scenarios?

billierinaldi avatar Jan 09 '20 17:01 billierinaldi

Does anyone have ideas for things we could do to test out tserver failure scenarios?

For testing, it would be nice to be able to make a tserver not halt when it looses its lock in ZK. The master would thinks its dead, but it would keep running. I can think of three ways to accomplish this.

First, we could just make it configurable but this feels wrong.

Second we could write some code in accumulo-testing that uses byte buddy to create a java agent which mutates the behavior of the tablet server code. I did something like this for nanoop. Maybe there is an easier way to mutate the behavior of compiled java code. This javaagent could be built in accumulo-testing and configured in accumulo-env.sh for a test. This option is very complicated, but it allows testing of released binaries w/o having to rebuild.

Third we could maintain a patch somewhere that changes the behavior of Accumulo on lock loss. For test you build Accumulo with this patch. I think this option is the most straightforward. For testing released versions, would need to get the source for that release and rebuild.

keith-turner avatar Jan 09 '20 18:01 keith-turner

Thanks for the good suggestions, @keith-turner. I have successfully written some code that uses bytebuddy to prevent the tserver from halting when I manually delete its ZooKeeper lock. Now I am working on designing a test that will highlight the lease recovery behavior.

billierinaldi avatar Jan 24 '20 15:01 billierinaldi