accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Tserver gets stuck if context classloader fails on first try

Open ivakegg opened this issue 1 year ago • 12 comments

accumulo 2.1.x:

If a context cannot be loaded when creating the compaction dispatcher (e.g. vfs class loader failed to retrieve jars from hdfs), then it will never be reattempted and you continuously get the message at CompactableImpl:1487 (Failed to dispatch compaction...). The only way to recover is to bounce the tserver. Potentially changing the context on the table may relieve the issue but I have not tested that.

ivakegg avatar Jan 24 '24 16:01 ivakegg

I marked this as a blocker for now. Depending on the release schedule for 2.1.3 we may want to push this.

EdColeman avatar Jan 24 '24 20:01 EdColeman

@ivakegg - are you setting Property.TABLE_COMPACTION_DISPATCHER, or using the default value?

dlmarion avatar Jan 26 '24 12:01 dlmarion

Looking at the code, it appears that TableConfiguration.createCompactionDispatcher will return null when the class fails to load, and so null is stored in the Deriver. When CompactableImpl.getConfiguredService is called, it returns the default service. There are several places in CompactableImpl where nothing is done if the configured service that is returned from getConfiguredService does not match the expected value. I'm assuming that you are running into this.

Looking at the javadoc for creating a new Deriver, it says that it is senstive to configuration changes. I would try changing the context name to a new valid context and see if that resolves the issue.

dlmarion avatar Jan 26 '24 13:01 dlmarion

The problem is that the class context is valid, but because of a thundering herd scenario, some of the tservers fail to load the context initially. Subsequently the tservers gets stuck. Simply changing the context to another one will results in the same scenario. I need something here that retries to load the class via the context classloader and the subsequently the compaction dispatcher.

ivakegg avatar Jan 26 '24 13:01 ivakegg

I do not believe we are not setting the dispatcher property. If we are, then we are setting it to the default value.

ivakegg avatar Jan 26 '24 13:01 ivakegg

If I set property via the shell or via the property editor to the same value, does that trigger the Deriver to reload ?

ivakegg avatar Jan 26 '24 14:01 ivakegg

If I set property via the shell or via the property editor to the same value, does that trigger the Deriver to reload ?

That I am not sure of. @keith-turner or @EdColeman might be able to answer that. IIRC the Deriver functionality is new in 2.1 with the re-write of the configuration implementation.

The problem is that the class context is valid, but because of a thundering herd scenario

I assume that you have set the replication on the HDFS classpath artifacts sufficiently high. Have you tried other mechanisms that VFS supports (e.g. http) for distributing the jars?

I need something here that retries to load the class via the context classloader and the subsequently the compaction dispatcher.

I think the VFS classloader should retry. Presumably the VFS classloader is getting an exception from HDFS, so it should retry in that case. If we were to do something in Accumulo, then we would have to do it in a lot of places. Additionally, the contract for a ClassLoader is that you call it once and you either get a Class or an Exception.

dlmarion avatar Jan 26 '24 15:01 dlmarion

Any change in ZooKeeper, either by the shell or the prop editor will trigger a property reload. It is independent of the method used to change the values. There is also a back-ground thread that sweeps through and will force an update if a watch notification was missed. The sweep is printed in the logs if you think watches are being missed.

EdColeman avatar Jan 26 '24 15:01 EdColeman

I just confirmed that in a running instance. This might provide the work around I am looking for. Next time I have a chance I will reproduce the situation and then try this. I would still like an attempt to recreate the dispatcher if it is null but a known work around is always good.

ivakegg avatar Jan 26 '24 15:01 ivakegg

@ivakegg are you okay if this is removed as a blocker for 2.1.3 then? We can keep the issue, but depending on the release schedule and any changes, the changes may not make 2.1.3. Are you okay with that?

EdColeman avatar Jan 26 '24 15:01 EdColeman

yes I am ok with that.

ivakegg avatar Jan 26 '24 15:01 ivakegg

@ivakegg Is this still an issue, or is that workaround fine?

ctubbsii avatar Jul 26 '24 16:07 ctubbsii