trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

Fix issue where origins could be unintentionally marked as down

Open yknoya opened this issue 1 month ago • 0 comments

Problem

https://github.com/apache/trafficserver/pull/9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.

This issue occurs under the following conditions:

  1. proxy.config.http.server_session_sharing.match is set to a value other than none (i.e., server session reuse is enabled).
  2. A server session is reused when connecting to the origin.
  3. The connection is closed after sending a request to the origin.
  4. Condition 3 occurs repeatedly until it reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries.

The issue has been confirmed in the following branches/versions (other versions not tested):

  • master (90dbc21)
  • 10.1.0
  • 9.2.11

Cause

When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO: https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L8054 https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/include/proxy/http/HttpTransact.h#L932

If server session reuse is not possible, connect_result is cleared once the connection is established: https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L1860

However, when a server session is reused, connect_result is not cleared and remains set to EIO. This regression was triggered by the change introduced in https://github.com/apache/trafficserver/pull/9181 .

Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused. After the PR, it is executed regardless of whether a server session is reused or not.

With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:

  1. https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3466
  2. https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3786
  3. https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpTransact.cc#L3884
  4. https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L4630
  5. https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5876

If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down: https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5876-L5885

Since the connection to the origin is actually successful, marking it as down is incorrect.

Fix

Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.

Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection. In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO: https://github.com/apache/trafficserver/blob/90dbc21a541986db6b223649cb4f798e190a550f/src/proxy/http/HttpSM.cc#L5706-L5723

This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.

yknoya avatar Dec 03 '25 23:12 yknoya