trafficserver
trafficserver copied to clipboard
Over-eager HTTP/2 stream error rate threshold
While testing Http/2 to origin with another ATS box acting as the origin, the setting proxy.config.http2.stream_error_rate_threshold is causing problems. The test scenario is below. I am running HTTP/2 to origin code on ATS1. ATS2 is running 9.0 with proxy.config.http2.stream_error_rate_threshold set to the default of 0.1
UA -> ATS1 -> ATS2 -> O
In my tests, I see that one of the request cases directed to ATS2 has a large increase in ERR_CONN_FAIL once we turn on HTTP/2 to origin. Looking at diags.log on ATS2, I do not see any HTTP/2 error messages for the client IP corresponding to ATS1, so this means we must be hitting the failure case in rcv_data_frame in Http2ConnectionState.cc shown below. The msg argument is null so, no message will show up in diags.log.
return Http2Error(Http2ErrorClass::HTTP2_ERROR_CLASS_STREAM, Http2ErrorCode::HTTP2_ERROR_STREAM_CLOSED, nullptr);
In this case, ATS2 has received a data frame on a stream that has already been deleted. Analogous to sending a TCP reset when receiving a packet for a socket that has been closed. Seems to happen pretty frequently in the HTTP/2 inbound case. Thus the removal of the message..
In my test case, all the communication to the problem domain is POST traffic. And the final origin (O) is known to have resource problems, so sometimes the connection to O will fail due to a timeout. In this scenario, ATS2 will return a 504 header immediately and not read the post body.
With the handing early header fix (PR #7976), ATS1 will process the response header and give up on sending the post body. However, some post body frames will be sent to ATS2 before the 504 response header is returned. It is likely that a data frame and response header will cross on the wire, resulting in a data frame being sent to ATS2 after the stream has been closed.
Once enough errors happen, a GOAWAY frame will be sent with no error code set. Either we process the GOAWAY frame to immediately close all active streams on the session (which in my scenarios was 5-20 streams or more), or you stop adding new streams to the session but leave the existing streams to finish. Closing the active streams greatly increased the number of ERR_CONN_FAIL's. Leaving the streams greatly increased the number of ABORTS and TIMEOUTS.
I plan on putting up two PRs. One that increases the default for proxy.config.http2.stream_error_rate_threshold from 0.1 to 0.5 (which we is what we are running on the ATS1 box). Different group manages the ATS2 box. Another PR that does not include the STREAM_CLOSED reset from rcv_data_from in the error code used to calculate the error rate.
I think this is due to lack of half-closed(local) state and also #5675.
I agree on excluding common error cases from error rate calculation. However, #8166 looks fragile since it relies on existence of error message, and the error without a message in rcv_data_frame (i.e. receiving DATA on a non-open stream) may not be always safe to exclude because the function can be called on any stream state. If we receive a DATA frame before HEADERS, that is obviously an error which we should not exclude.
I think we can exclude the error only if a stream is on half-closed (local) state. DATA frames may have been already on wire when ATS sends 504 response, and a RST_STREAM to tell a client not to send any more data on the sate is innocuous.
After sending a RST_STREAM on half-closed(local) state, the stream state gets closed state. Then ATS should ignore frames for a while as described on #5675. If a client keeps sending frames to streams on closed state, that should be counted as an error.
This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.