trafficserver Crash HttpSM::setup_cache_write_transfer while handling redirect

We're seeing a crash related to an interaction between cache and handling redirects:

#0  0x0000000000587c7d in HttpSM::setup_cache_write_transfer (name=0x8e1ab5 "cache write", skip_bytes=1395, store_info=0x7fa31d298180, source_vc=0x7fadc0aedc90, c_sm=0x7fa31d29a430, this=0x7fa31d298000)
    at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/proxy/http/HttpSM.cc:6770
#1  HttpSM::perform_cache_write_action (this=this@entry=0x7fa31d298000) at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/proxy/http/HttpSM.cc:6425
#2  0x000000000059d25e in HttpSM::handle_api_return (this=0x7fa31d298000) at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/proxy/http/HttpSM.cc:1630
#3  0x000000000059731e in HttpSM::state_api_callout (this=0x7fa31d298000, event=<optimized out>) at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/proxy/http/HttpSM.cc:1491
#4  0x000000000059e4a6 in HttpSM::state_api_callback (this=this@entry=0x7fa31d298000, event=event@entry=60000, data=data@entry=0x0)
    at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/proxy/http/HttpSM.cc:1295
#5  0x00007fadce570813 in TSHttpTxnReenable (txnp=0x7fa31d298000, event=TS_EVENT_HTTP_CONTINUE) at /sd/workspace/src/git.ouryahoo.com/Edge/build/_scm/trafficserver10.0/src/api/InkAPI.cc:4935
#6  0x00007fac5029cc76 in ?? ()
#7  0x00007fa31d298000 in ?? ()
#8  0x00007fa81fcac1c0 in ?? ()
#9  0x0000000000000000 in ?? ()

Some interesting variables @JosiahWI asked me to look at:

(gdb) p t_state.redirect_info.redirect_in_process
$1 = true

(gdb) f 1
(gdb) p t_state.cache_info.action
$2 = HttpTransact::CACHE_DO_WRITE

(gdb) p t_state.cache_info.write_lock_state
$1 = HttpTransact::CACHE_WL_SUCCESS

This is likely related to: https://github.com/apache/trafficserver/pull/11542

Aug 16 '24 00:08 bneradt

cache_sm->cache_write_vc == nullptr

Aug 16 '24 00:08 bryancall

I've studied the code and written an AuTest to attempt a reproduction (not successful yet).

Investigation

We do, at some point prior to deciding to execute the write, open a cache write connection. This is because it's not possible to have a CACHE_WL_SUCCESS state unless that happens, and that is the precondition for deciding to do a write instead of open a new connection when a redirected request gets a response.
As @bryancall discovered, the connection we opened has vanished when we attempt the write. There are two places where we set the cache_write_vc to nullptr because we moved ownership to another object, and one where we initialize it and it could become nullptr if that's what it was initialized with.
The reason we didn't see this crash before is that we never got this far in the logic. #9275 would cause us to try to take out the write lock we already held and hang until we timed out and decided to skip the cache write altogether. Now that we get far enough to attempt to do the write, we're reaching states that were impossible to reach previously.

Reproduction Ideas

It's not particularly straightforward to reproduce given what we know so far. My first attempt was to force a redirection using the escalate plugin. This mirrors how we think the crash we observed starts. The escalate plugin overrides all the redirection configuration, and I wanted to make sure the timing of the state change on the plugin's hook wasn't the cause of the issue. There were no issues with this setup and it works exactly as expected.

My conclusion from all this is that something special needs to happen during the redirection, interfering with it and getting rid of the cache write VC while the redirection request or response is being processed. In the crash we're seeing this is very likely initiated by another plugin on one of the request hooks. I suspect this may also be timing dependent.

The only two places in the state machine where the write connection is directly set to nullptr are:

void
HttpSM::setup_cache_write_transfer(HttpCacheSM *c_sm, VConnection *source_vc, HTTPInfo *store_info, int64_t skip_bytes,
                                   const char *name)
{
  ink_assert(c_sm->cache_write_vc != nullptr);
  ink_assert(t_state.request_sent_time > 0);
  ink_assert(t_state.response_received_time > 0);

  store_info->request_sent_time_set(t_state.request_sent_time);
  store_info->response_received_time_set(t_state.response_received_time);

  c_sm->cache_write_vc->set_http_info(store_info);
  store_info->clear();

  tunnel.add_consumer(c_sm->cache_write_vc, source_vc, &HttpSM::tunnel_handler_cache_write, HT_CACHE_WRITE, name, skip_bytes);

  c_sm->cache_write_vc = nullptr;
}

  // in HttpSM::perform_cache_write_action
  case HttpTransact::CACHE_DO_WRITE:
  case HttpTransact::CACHE_DO_REPLACE:
    // Fix need to set up delete for after cache write has
    //   completed
    if (transform_info.entry == nullptr || t_state.api_info.cache_untransformed == true) {
      cache_sm.close_read();
      t_state.cache_info.write_status = HttpTransact::CACHE_WRITE_IN_PROGRESS;
      setup_cache_write_transfer(&cache_sm, server_entry->vc, &t_state.cache_info.object_store, client_response_hdr_bytes,
                                 "cache write");
    } else {
      // We are not caching the untransformed.  We might want to
      //  use the cache writevc to cache the transformed copy
      ink_assert(transform_cache_sm.cache_write_vc == nullptr);
      transform_cache_sm.cache_write_vc = cache_sm.cache_write_vc;
      cache_sm.cache_write_vc           = nullptr;
    }
    break;

Aug 16 '24 12:08 JosiahWI

Attaching the gdb output of the HttpSM history variable:

history.txt

Aug 16 '24 16:08 bneradt

The following patch provided by @JosiahWI makes things stable:


diff --git a/src/proxy/http/HttpSM.cc b/src/proxy/http/HttpSM.cc
index a54cc7ccc..46c50b20d 100644
--- a/src/proxy/http/HttpSM.cc
+++ b/src/proxy/http/HttpSM.cc
@@ -6460,6 +6460,7 @@ HttpSM::perform_cache_write_action()
     // Write close deletes the old alternate
     cache_sm.close_write();
     cache_sm.close_read();
+    t_state.cache_info.write_lock_state = HttpTransact::CACHE_WL_INIT;
     break;
   }
 
@@ -6518,6 +6519,7 @@ HttpSM::issue_cache_update()
   }
   // Now close the write which commits the update
   cache_sm.close_write();
+  t_state.cache_info.write_lock_state = HttpTransact::CACHE_WL_INIT;
 }
 
 int

Apr 28 '25 15:04 bneradt