Fix issue where origins could be unintentionally marked as down
Problem
https://github.com/apache/trafficserver/pull/9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.
This issue occurs under the following conditions:
proxy.config.http.server_session_sharing.matchis set to a value other thannone(i.e., server session reuse is enabled).- A server session is reused when connecting to the origin.
- The connection is closed after sending a request to the origin.
- Condition 3 occurs repeatedly until it reaches the threshold defined by
proxy.config.http.connect_attempts_rr_retries.
The issue has been confirmed in the following branches/versions (other versions not tested):
- master (90dbc21)
- 10.1.0
- 9.2.11
Cause
When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO:
https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L8054
https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/include/proxy/http/HttpTransact.h#L932
If server session reuse is not possible, connect_result is cleared once the connection is established:
https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L1860
However, when a server session is reused, connect_result is not cleared and remains set to EIO.
This regression was triggered by the change introduced in https://github.com/apache/trafficserver/pull/9181 .
Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused.
After the PR, it is executed regardless of whether a server session is reused or not.
With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:
- https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3466
- https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3786
- https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3884
- https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L4630
- https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5876
If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:
https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5876-L5885
Since the connection to the origin is actually successful, marking it as down is incorrect.
Fix
Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.
Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection.
In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO:
https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5706-L5723
This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.