spring-cloud-gateway icon indicating copy to clipboard operation
spring-cloud-gateway copied to clipboard

Some requests are failing when using HEAD requests

Open vtkhir opened this issue 4 years ago • 9 comments

Describe the bug Some requests are failing randomly when HEAD requests and chunked responses are used. Noticed that Netty sometimes tries to read a last "zero" chunk as a start of HTTP response. Netty tries to determine the HTTP protocol version from last "zero" chunk.

2021-11-08 16:59:26.943 DEBUG 24184 --- [ctor-http-nio-8] r.n.channel.ChannelOperationsHandler     : [4a6c80b6, L:/10.1.222.229:64941 - R:destination-service-balancer-1447849684.us-east-1.elb.amazonaws.com/54.211.139.210:8080] Decoding failed: DefaultFullHttpResponse(decodeResult: failure(java.lang.IllegalArgumentException: invalid version format: 0), version: HTTP/1.0, content: UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf(ridx: 0, widx: 0, cap: 0))
HTTP/1.0 999 Unknown : 

java.lang.IllegalArgumentException: invalid version format: 0
	at io.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:116) ~[netty-codec-http-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:78) ~[netty-codec-http-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.http.HttpResponseDecoder.createMessage(HttpResponseDecoder.java:157) ~[netty-codec-http-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:273) ~[netty-codec-http-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:238) ~[netty-codec-http-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[netty-codec-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[netty-codec-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[netty-codec-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[netty-handler-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[netty-transport-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[netty-common-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.69.Final.jar:4.1.69.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.69.Final.jar:4.1.69.Final]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

Attached log file. Failed request is GET /api/timeout-scg/test?id=99b6f3f0-abd6-4af8-8145-1af9a1ce0cd7. This unstable behavior blocks our migration from Zuul-based implementation to SCG.

Sample I created a destination service for demo purposes and deployed it to EC2. This service is accessible via AWS load balancer. To reproduce the issue:

JDK vesion: OpenJdk64-11.0.10 OS: MacOS 11 & Ubuntu 18.04

vtkhir avatar Nov 08 '21 16:11 vtkhir

We identified that the issue is reproducible only with classic AWS load balancers. No issue when an application load balancer is used

vtkhir avatar Nov 09 '21 12:11 vtkhir

Given the stack trace, this is likely a netty bug

spencergibb avatar Mar 02 '22 18:03 spencergibb

The community support is very disappointing. 3 months of silence and closing the issue with the keyword "likely"...

vtkhir avatar Mar 02 '22 20:03 vtkhir

I am not a netty expert. I can ask @violetagg if she can take a look

spencergibb avatar Mar 02 '22 20:03 spencergibb

I am not a netty expert. I can ask @violetagg if she can take a look

I'll take a look at the provided details

violetagg avatar Mar 02 '22 21:03 violetagg

it still happens with boot 2.6.3

spencergibb avatar Mar 02 '22 21:03 spencergibb

Noticed that Netty sometimes tries to read a last "zero" chunk as a start of HTTP response. Netty tries to determine the HTTP protocol version from last "zero" chunk.

This is not a zero chunk as there is no Transfer-encoding. This is an unexpected content sent after the response.

The response from the destination service for some HEAD requests is strange:

This is from the Reactor Netty wiretap - client port 52200 (see at the end of the response)

2022-03-08 19:52:26.151 DEBUG 2745 --- [or-http-epoll-3] reactor.netty.http.client.HttpClient     : [e1a87c22-2, L:/IP:52200 - R:IP:8080] READ: 115B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 48 54 54 50 2f 31 2e 31 20 32 30 30 20 0d 0a 43 |HTTP/1.1 200 ..C|
|00000010| 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 70 |ontent-Type: app|
|00000020| 6c 69 63 61 74 69 6f 6e 2f 6a 73 6f 6e 0d 0a 44 |lication/json..D|
|00000030| 61 74 65 3a 20 54 75 65 2c 20 30 38 20 4d 61 72 |ate: Tue, 08 Mar|
|00000040| 20 32 30 32 32 20 31 39 3a 35 32 3a 32 35 20 47 | 2022 19:52:25 G|
|00000050| 4d 54 0d 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 |MT..Connection: |
|00000060| 6b 65 65 70 2d 61 6c 69 76 65 0d 0a 0d 0a 30 0d |keep-alive....0.|
|00000070| 0a 0d 0a                                        |...             |
+--------+-------------------------------------------------+----------------+

And this is the tcp dump

Screenshot 2022-03-08 at 22 25 01

In comparison this is a GOOD HEAD response

wiretap - client port 52202:

2022-03-08 19:52:25.789 DEBUG 2745 --- [or-http-epoll-2] reactor.netty.http.client.HttpClient     : [1a081fb1-1, L:/IP:52202 - R:IP:8080] READ: 110B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 48 54 54 50 2f 31 2e 31 20 32 30 30 20 0d 0a 43 |HTTP/1.1 200 ..C|
|00000010| 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 70 |ontent-Type: app|
|00000020| 6c 69 63 61 74 69 6f 6e 2f 6a 73 6f 6e 0d 0a 44 |lication/json..D|
|00000030| 61 74 65 3a 20 54 75 65 2c 20 30 38 20 4d 61 72 |ate: Tue, 08 Mar|
|00000040| 20 32 30 32 32 20 31 39 3a 35 32 3a 32 35 20 47 | 2022 19:52:25 G|
|00000050| 4d 54 0d 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 |MT..Connection: |
|00000060| 6b 65 65 70 2d 61 6c 69 76 65 0d 0a 0d 0a       |keep-alive....  |
+--------+-------------------------------------------------+----------------+

tcpdump:

Screenshot 2022-03-08 at 22 21 15

@vtkhir Can you take a tcp dump from the destination service so that we can see what exactly was sent by Tomcat?

violetagg avatar Mar 08 '22 21:03 violetagg

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

spring-cloud-issues avatar Mar 15 '22 22:03 spring-cloud-issues

Hey @violetagg @spencergibb appreciate you're looking into this issue, we're following it!

Seems we found what exactly causes incompatibility with SCG - the issue is reproducible only with AWS Classic Load Balancer. So we decided to try the Application AWS Load Balancer, and considered no unexpected content sent after the response. That approached us to change production AWS setup and successfully finalise migration to latest SCG

oserohin avatar Mar 16 '22 07:03 oserohin

@vtkhir We have the same problem. Have you solved it

johnson622 avatar Oct 04 '22 10:10 johnson622

Hello @johnson622 Yeap, just moved ahead with migration all outdated AWS CLBs to ALBs as described in previous comment

oserohin avatar Oct 04 '22 11:10 oserohin

Hello @johnson622 Yeap, just moved ahead with migration all outdated AWS CLBs to ALBs as described in previous comment

@oserohin Thank you very much. What can we do now? Our production environment has been abnormal, do i need to wait for your repair?When can you repair it?

johnson622 avatar Oct 06 '22 08:10 johnson622

Closing as it is an issue on AWS

spencergibb avatar Jan 09 '23 23:01 spencergibb