dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Uncaught exception in GFTP

Open DmitryLitvintsev opened this issue 3 years ago • 5 comments

After upgrade to 7.2 the following exception is seen. I do not believe I have seen it before.

01 Nov 2021 14:37:05 (GFTP0-fndca4b-AAXPv01F4lA) [door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain] Uncaught exception in thread GFTP0-fndca4b-io-51
io.netty.handler.codec.TooLongFrameException: frame length (84173) exceeds the allowed maximum (65536)
        at io.netty.handler.codec.LineBasedFrameDecoder.fail(LineBasedFrameDecoder.java:158)
        at io.netty.handler.codec.LineBasedFrameDecoder.fail(LineBasedFrameDecoder.java:154)
        at io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:141)
        at io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:84)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:50)
        at java.base/java.lang.Thread.run(Thread.java:829)

It is always the same frame length : 84173 And it looks to be ~ 200 / day frequency.

DmitryLitvintsev avatar Nov 01 '21 19:11 DmitryLitvintsev

pipeline.addBefore(self, "frameDecoder", new LineBasedFrameDecoder(KiB.toBytes(64)));

in the NettyLineBasedDoor. Where does this max size come from? The question of course is why are we getting the oversized frame. That is, what would not be observing the 64K limit?

alrossi avatar Nov 01 '21 21:11 alrossi

access file for this interaction:

level=INFO ts=2021-11-01T14:37:05.685-0500 event=org.dcache.ftp.hello session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain socket.remote=131.225.67.13:43624 socket.local=131.225.69.121:2811 reply="220 GSI FTP door ready"
level=INFO ts=2021-11-01T14:37:05.687-0500 event=org.dcache.ftp.response session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain command="AUTH GSSAPI" reply="334 ADAT must follow"
level=INFO ts=2021-11-01T14:37:05.702-0500 event=org.dcache.ftp.response session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain command="ADAT ..." reply="335 ADAT=..."
level=INFO ts=2021-11-01T14:37:05.709-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:43624 session=door:GFTP0-fndca4b-AAXPv01F4lA@gridftp0-fndca4bDomain

DmitryLitvintsev avatar Nov 01 '21 22:11 DmitryLitvintsev

Some ideas/questions:

  1. Are these problems coming from that same remote host (or a limited set of remote hosts) ?
  2. Assuming yes, are all transfers from that host (or "those hosts") always failing?
  3. Can we capture the network traffic for one of these failures, to see what the client is sending?
  4. Can we find the user responsible for these failures and ask them what they're doing?

paulmillar avatar Nov 03 '21 12:11 paulmillar

Excellent suggestions, Paul, as always. The Issue has been created as a place holder to go back to if team has no immediate ideas what that might be. It is bloodbath here at Fermilab since upgrade. Once I have a little more time to dedicate to this issue all these avenues will be explored.

DmitryLitvintsev avatar Nov 03 '21 14:11 DmitryLitvintsev

 doors=`zgrep -B1 'TooLongFrameException' gridftp0-fndca4bDomain.log-20211102.gz  | grep GFTP  | awk '{ print $5}' | sed -e 's/[\(\)]//g'`

[root@fndca4b dcache-log]# for d in $doors ; do  zgrep "$d" gridftp0-fndca4bDomain.access.2021-11-01.gz | grep "host"; done
level=INFO ts=2021-11-01T21:00:05.423-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47088 session=door:GFTP0-fndca4b-AAXPxKb4KDA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:23.555-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47090 session=door:GFTP0-fndca4b-AAXPxKgNGqA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:24.592-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47092 session=door:GFTP0-fndca4b-AAXPxKgc5Zg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:24.625-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47094 session=door:GFTP0-fndca4b-AAXPxKgdamg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:00:37.765-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47100 session=door:GFTP0-fndca4b-AAXPxKjl4rg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:02:25.089-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47114 session=door:GFTP0-fndca4b-AAXPxK9LgTA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:56.306-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47116 session=door:GFTP0-fndca4b-AAXPxLS7UeA@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:57.339-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47118 session=door:GFTP0-fndca4b-AAXPxLTLNEg@gridftp0-fndca4bDomain
level=INFO ts=2021-11-01T21:03:57.375-0500 event=org.dcache.ftp.disconnect host.remote=131.225.67.13:47120 session=door:GFTP0-fndca4b-AAXPxLTLsUg@gridftp0-fndca4bDomain

same host.

There are successful transfers :

billing=# select sum(case when errorcode <> 0 then 1 else 0 end) as errors, sum(case when errorcode <> 0 then 0 else 1 end) as success, protocol from billinginfo where client='131.225.67.13' and datestamp > current_date - interval '2 days'  group by protocol;
 errors | success |  protocol  
--------+---------+------------
      0 |     537 | NFS4-4.1
      0 |     991 | Xrootd-5.0
(2 rows)

This seems like interactive node. I will ask user what are they doing.

DmitryLitvintsev avatar Nov 03 '21 14:11 DmitryLitvintsev