dcache WebDAV door talking to offline RemoteTransferManager

Hi dCache.org,

on August 2nd we updated all our dCache SEs to 7.2.20, so we can monitor the active HTTP TPCs - 👍 for that btw! That version also enabled redundancy for the RemoteTransferManagers, according to the release notes*. So we set up two new domains with a RemoteTransferManager cell each. Accidentally though, the (now legacy, single) RemoteTransferManager we had before the update was running for a couple of minutes, too (thus a total of three distinct RTM cells). I shut down that third RTM as soon as I noticed the mishap. The WebDAV doors however were trying to talk to it days after that:

05 Aug 2022 08:55:13 (WebDAV-f01-210-127) [door:WebDAV-f01-210-127@webdav-f01-210-127Domain:AAXlQtAU7aA] Failed to fetch information for progress marker: Request to [>RemoteTransferManager@remoteTransferManagerDomain] timed out. 05 Aug 2022 08:55:13 (WebDAV-f01-210-127) [door:WebDAV-f01-210-127@webdav-f01-210-127Domain:AAXlQtAU7aA] Failed to cancel transfer id=1659450005198000: CacheException(rc=10006;msg=Request to [>RemoteTransferManager@remoteTransferManagerDomain] timed

Only a restart of the WebDAV doors fixed this (immediately). Maybe this is an issue explained by the previous guarantee, that there are never more than one RemoteTransferManager and therefor WebDAV doesn't forget about it? The client (ATLAS in this case) had not complained about any errors. I noticed this problem when our logfiles for the WebDAV doors grew larger than 2 GB.

Maybe this isn't such a big issue after all. The error reads like the door failed to query the RTM. Though what if the door just did a broadcast to all RTMs and this one simply did not reply? The transfermanager properties file does document the transfermanagers.cell.consume property, which is set to transfermanagers.cell.name. We still have two RTMs, for which the cell name is unaltered. That explains why things still are working. It does not explain the error from the WebDAV logfile.

Maybe you want to take a look into this?

Best regards, Xavier.

PS: I doubt this is related, but the WebDAV door also occasionally logged this NPE while we had three RTMs online:

04 Aug 2022 11:52:17 (WebDAV-f01-210-127) [httpdTransferObserver] Command failed due to a bug, please contact [email protected]. dmg.util.CommandPanicException: (1) Command failed: java.lang.NullPointerException at dmg.util.command.AcCommandExecutor.execute(AcCommandExecutor.java:161) at dmg.cells.nucleus.CellAdapter.executeCommand(CellAdapter.java:209) at org.dcache.cells.UniversalSpringCell.executeCommand(UniversalSpringCell.java:189) at dmg.cells.nucleus.CellAdapter$1.doExecute(CellAdapter.java:90) at org.dcache.util.cli.CommandInterpreter.command(CommandInterpreter.java:117) at dmg.cells.nucleus.CellAdapter$1.command(CellAdapter.java:106) at dmg.cells.nucleus.CellAdapter.command(CellAdapter.java:195) at dmg.cells.nucleus.CellAdapter.executeLocalCommand(CellAdapter.java:882) at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:812) at dmg.cells.nucleus.CellNucleus$DeliverMessageTask.run(CellNucleus.java:1274) at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:247) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:727) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.NullPointerException: null

*) Note that transfermanagers.cell.replicable still is set to false in the default transfermanager properties.

Aug 10 '22 13:08 XMol

Thanks for reporting the issue.

As a side note: from the next bugfix release versions onward, ransfermanagers.cell.replicable is set to true in the default transfermanager properties.

Aug 10 '22 15:08 lemora

Hi @XMol,

Thanks for reporting this.

The log entries mentioned RemoteTransferManager@remoteTransferManagerDomain, so it looks like the WebDAV door is trying to contact a specific RTM cell, rather than the generic topic. This only happens if the WebDAV door contacts an updated RTM. So, could RemoteTransferManager@remoteTransferManagerDomain be the name of one of the (updated) RTM cell?

It's also surprising that this retrying continued for days. The periodic retying should continue while there is a client listening for progress reports. The client can cancel a transfer by disconnecting, at which point the WebDAV door should stop requesting progress reports. This could indicate a problem; however, I think we would need to collect more information about what exactly happened before being able to fix it.

Finally, the NPE stack-trace indicates a problem with an admin command. Unfortunately, the useful information (describing where is the bug) is missing. Normally, this would appear as an additional stack-trace after the line Caused by: java.lang.NullPointerException: null.

Logback will suppress stack-traces if they happen too often. In general, this is a "good thing", but it can make fixing bugs a pain. You may find that an earlier log entry (perhaps just after the WebDAV door was last started) contains the full stack-trace.

Aug 25 '22 14:08 paulmillar

Hi @paulmillar,

thank you for coming back to this.

You may find that an earlier log entry (perhaps just after the WebDAV door was last started) contains the full stack-trace.

Here is where I shut down the WebDAV door and the first messages after it was started again:

02 Aug 2022 13:00:04 (System) [] Unexpected exception during shutdown. java.util.NoSuchElementException: No such cell: c-core-f01-210-115-eDomain-AAXh9TPMO4g at dmg.cells.nucleus.CellGlue.kill(CellGlue.java:249) at dmg.cells.nucleus.CellNucleus.kill(CellNucleus.java:619) at java.base/java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:178) at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at dmg.cells.nucleus.SystemCell.shutdownCells(SystemCell.java:192) at dmg.cells.nucleus.SystemCell.shutdownSystem(SystemCell.java:164) at dmg.cells.nucleus.SystemCell.stopped(SystemCell.java:103) at dmg.cells.nucleus.CellAdapter.postRemoval(CellAdapter.java:722) at dmg.cells.nucleus.CellNucleus.shutdown(CellNucleus.java:1062) at dmg.cells.nucleus.CellGlue.lambda$doKill$3(CellGlue.java:423) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 02 Aug 2022 13:23:44 (System) [] Invalid config event received: {server.2=zk-server-3:2888:3888:participant, server.1=zk-server-2:2888:3888:participant, server.0=zk-server-1:2888:3888:participant, version=0} 02 Aug 2022 13:23:44 (System) [] Invalid config event received: {server.2=zk-server-3:2888:3888:participant, server.1=zk-server-2:2888:3888:participant, server.0=zk-server-1:2888:3888:participant, version=0} 02 Aug 2022 13:23:48 (WebDAV-f01-210-127) [] Weak cipher suite TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA enabled for @5f62b611[provider=null,keyStore=null,trustStore=null] 02 Aug 2022 13:23:48 (WebDAV-f01-210-127) [] Weak cipher suite TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA enabled for @5f62b611[provider=null,keyStore=null,trustStore=null] 02 Aug 2022 13:23:48 (WebDAV-f01-210-127) [] Weak cipher suite TLS_DHE_RSA_WITH_AES_256_CBC_SHA enabled for @5f62b611[provider=null,keyStore=null,trustStore=null] 02 Aug 2022 13:23:48 (WebDAV-f01-210-127) [] Weak cipher suite TLS_DHE_RSA_WITH_AES_128_CBC_SHA enabled for @5f62b611[provider=null,keyStore=null,trustStore=null] 02 Aug 2022 13:23:48 (WebDAV-f01-210-127) [NIC auto-discovery] Adding [fe80::250:56ff:feb0:f62a, fe80::250:56ff:feb0:796a, f01-210-127.gridka.de/10.65.210.127] 02 Aug 2022 13:23:52 (WebDAV-f01-210-127) [NIC auto-discovery] Adding atlassrm-kit.gridka.de/2a00:139c:5:1d9:0:41:d2:90 02 Aug 2022 13:23:53 (WebDAV-f01-210-127) [NIC auto-discovery] Adding atlassrm-kit.gridka.de/2a00:139c:5:1d9:0:41:d2:7f 02 Aug 2022 13:23:58 (WebDAV-f01-210-127) [NIC auto-discovery] Adding [atlassrm-kit.gridka.de/192.108.47.144, atlassrm-kit.gridka.de/192.108.47.156] 02 Aug 2022 13:24:03 (WebDAV-f01-210-127) [NIC auto-discovery] Removing [atlassrm-kit.gridka.de/[2a00:139c:5:1d9:0:41:d2:90], atlassrm-kit.gridka.de/192.108.47.156] 02 Aug 2022 13:24:30 (System) [] Client session timed out, have not heard from server in 40013ms for sessionid 0x1000000e4aa37bd 02 Aug 2022 13:24:30 (System) [] Invalid config event received: {server.2=zk-server-3:2888:3888:participant, server.1=zk-server-2:2888:3888:participant, server.0=zk-server-1:2888:3888:participant, version=0} 02 Aug 2022 13:24:30 (System) [] Can't find our node. Resetting. Index: -1

Then much later this stack trace is logged:

02 Aug 2022 16:51:04 (WebDAV-f01-210-127) [Frontend-atlasdcacheweb-kit] Command failed due to a bug, please contact [email protected]. dmg.util.CommandPanicException: (1) Command failed: java.lang.NullPointerException at dmg.util.command.AcCommandExecutor.execute(AcCommandExecutor.java:161) at dmg.cells.nucleus.CellAdapter.executeCommand(CellAdapter.java:209) at org.dcache.cells.UniversalSpringCell.executeCommand(UniversalSpringCell.java:189) at dmg.cells.nucleus.CellAdapter$1.doExecute(CellAdapter.java:90) at org.dcache.util.cli.CommandInterpreter.command(CommandInterpreter.java:117) at dmg.cells.nucleus.CellAdapter$1.command(CellAdapter.java:106) at dmg.cells.nucleus.CellAdapter.command(CellAdapter.java:195) at dmg.cells.nucleus.CellAdapter.executeLocalCommand(CellAdapter.java:882) at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:812) at dmg.cells.nucleus.CellNucleus$DeliverMessageTask.run(CellNucleus.java:1274) at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:247) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:727) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.NullPointerException: null at org.dcache.webdav.transfer.RemoteTransferHandler$RemoteTransfer.describe(RemoteTransferHandler.java:815) at org.dcache.webdav.transfer.RemoteTransferHandler.lambda$addTransfers$4(RemoteTransferHandler.java:643) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3605) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:502) at org.dcache.webdav.transfer.RemoteTransferHandler.addTransfers(RemoteTransferHandler.java:644) at org.dcache.webdav.DcacheResourceFactory.ac_get_door_info(DcacheResourceFactory.java:1358) at jdk.internal.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at dmg.util.command.AcCommandExecutor.execute(AcCommandExecutor.java:138) ... 14 common frames omitted

Again a CommandPanicException plus NPE but less omitted frames. Is that what you were hoing for?

So, could RemoteTransferManager@remoteTransferManagerDomain be the name of one of the (updated) RTM cell?

Yes, that was the name of the RTM we had before. It, too, was updated to the same dCache version, but should not have been started. Together with the update, two new domains were created to house two separate, new RTM cells, which are named differently for clarity.

Ciao,
Xavier.

Aug 25 '22 15:08 XMol

Here is where I shut down the WebDAV door and the first messages after it was started again:

Thanks. The first stack-trace (Unexpected exception during shutdown) is unexpected. It looks like there's a race-condition (when shutting down) whether the tunnels (used for cell-communication) shutdown before the cell is explicitly killed. If that happens at "just the right time" then this stack-trace is printed.

Does this happen very often?

Again a CommandPanicException plus NPE but less omitted frames. Is that what you were hoing for?

Yes, indeed. I think I understand what is going wrong.

Patch: https://rb.dcache.org/r/13645/

Yes, that was the name of the RTM we had before.

OK, that part then makes sense.

I think I have an idea what went wrong and how to fix it.

Patch: https://rb.dcache.org/r/13646/

Aug 25 '22 20:08 paulmillar

Hi Paul.

Does this happen very often?

I cannot tell. Such exceptions didn't stand out to me as exceptional, but rather an inevitable product off shutting down a cell. Also, we don't shut down domains often at all (just for downtimes or reloading configuration), so very few data points. (relative to the time that has passed in between).

I think I have an idea what went wrong and how to fix it.

Your explanations on the patches are very sound, to me at least. 🙂

You may close this issue already, if you like. I doubt we will be able to update to whatever release will include these patches and then demonstrate under controlled conditions that the behavior has changed.

Thanks for the support,
Xavier.

Aug 26 '22 06:08 XMol

We have another observation with two RemoteTransferManagers enabled on one instance at BNL. We observed some 5-10% failures in transfers. In the billing logs, the errors are related to post transfer validation like: Post-processing failed: No such file or directory: .... On FTS, these errors appear as Transfer failed: failure: RemoteTransferManager restarted.

Let us know if this comment is relevant here or it should go elsewhere.

Aug 26 '22 11:08 vingar

@vingar,

Thanks for letting us know.

At first blush, this sounds like an unrelated problem.

Could you open a fresh issue for this?

Cheers, Paul.

Aug 26 '22 11:08 paulmillar

@XMol,

The shutdown shouldn't generate any stack-traces (stack-traces == bugs) .. but I probably would need some more information to figure out what is going wrong.

I'd say we leave this issue focused on the two WebDAV-specific problems. If you see a stack-trace on shutdown, please feel free to open a fresh ticket.

Don't worry, this ticket should be closed automatically once the patches are committed :-)

Aug 26 '22 11:08 paulmillar

dcache dcache copied to clipboard

WebDAV door talking to offline RemoteTransferManager

dcache
dcache copied to clipboard