cache write lock not released when following redirect and no parent
Configuration
- ATS 8.1.x, single-tier CDN (no parent caches, just edge caches and origins)
- proxy.config.http.parent_proxy_routing_enable set to 0 or proxy.config.http.parent_proxy_routing_enable set to 1 with empty parent.config
- origin is a redirecting origin that always responds with a 307 to the "actual" origin
- hosting.config uses volume 2 for host of original request, but has no entry for IP address of redirected host
The basic gist of what is happening is:
- the first request for an asset on a DS comes in and is remapped to origin, which is mapped to volume 2 (ramdisk) by hosting.config
- the object is not found and results in the expected cache miss
- ATS opens a write lock on the ramdisk for the asset and issues a request to the origin to fetch it
- the origin responds with a 307
- since ATS is configured to follow redirections, a new read request is issued with the new URL
- the host of the new URL is not mapped to ramdisk by hosting.config, so ATS looks for the object on the "generic" volume 1
- the object is also not found on volume 1, so ATS takes a write lock on that volume and issues a request to the "real origin" to fetch
- object is fetched successfully, written to volume 1, and streamed to client
- volume 1 cache write lock is released
- volume 2 (ramdisk) cache write lock is never released
- When subsequent requests for the same asset are received (read_while_writer enabled):
-
- ATS looks for the object in ramdisk, since that is were the remapped origin is supposed to store content
-
- Because of the write lock, read waits for any content to appear in the ramdisk cache, which will never happen
-
- read retries the configured number of times, then fails and issues its own write request to origin
-
- ATS fails the configured number of times to open the write lock (because the first request still has it), then just goes straight to origin
-
- origin returns a 307 response to the same location as before and ATS follows it
-
- ATS looks in volume 1 based on the host of the "real origin", finds the content in cache and returns a HIT
-
- but by this time the client has already given up and aborted, depending on how many retries were configured and how long each waited
We tried setting proxy.config.http.redirect_use_orig_cache_key to 1, but it appears to make no difference.
We saw this when a customer upgraded from a system that had previously been using ATC 3.x which never generated empty parent.config files. If we restore the following default line to parent.config which actually has no parents in it, it appears to solve the issue:
dest_domain=. parent="" round_robin=consistent_hash go_direct=false qstring=ignore
The difference appears to be that in the case of empty parent.config or proxy routing enable set to 0, the 307 is treated via HandleCacheMiss and a new ISSUE_WRITE is performed which opens a new CacheVC on the default volume. When the default line is in parent.config, this code is executed instead:
else if (s->dns_info.lookup_name[0] <= '9' && s->dns_info.lookup_name[0] >= '0' && s->parent_params->parent_table->hostMatch &&
!s->http_config_param->no_dns_forward_to_parent) {
// note, broken logic: ACC fudges the OR stmt to always be true,
// 'AuthHttpAdapter' should do the rev-dns if needed, not here .
TRANSACT_RETURN(SM_ACTION_DNS_REVERSE_LOOKUP, HttpTransact::StartAccessControl);
}
In this case there is a message from decideCacheLookup "will NOT do lookup" and ATS just reuses the CacheVC that was opened for the original lookup and actually writes to the ramdisk so that future requests for the object get the immediate cache hit in the ramdisk as expected. But this looks like it only worked because the redirect was to an IP address and not to another fqdn.
It seems either something isn't quite right with the redirect following or something that is supposed to free the first CacheVC got missed, leaving the write lock in place.
@bdgranger does the same problem exist in 9.1?
@bdgranger does the same problem exist in 9.1?
@ywkaras we will have to test this. Will get back to you. Looks like the OSDNSLookup() method has substantial changes since 8.1.x branch
@bdgranger any updates?
@ywkaras
Sorry it took me so long to notice this question. As of now, I have been tied up in other issues and have not been able to test with 9.x. We have worked around the issue for now by making sure that parent.config always has a default "dest_domain=. ..." line in it and there is then no problem.
This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import functools
from typing import Any, Callable, Dict, Optional
from ports import get_port
Test.Summary = 'Tests SNI port-based routing'
TestParams = Dict[str, Any]
class TestRamdiskRedirect:
"""Configure a test for reproducing #9275."""
replay_filepath_one: str = "ramdisk-redirect.replay.yaml"
replay_filepath_two: str = "ramdisk-redirect-ok.replay.yaml"
client_counter: int = 0
server_counter: int = 0
ts_counter: int = 0
def __init__(self, name: str, /, autorun) -> None:
"""Initialize the test.
:param name: The name of the test.
"""
self.name = name
self.autorun = autorun
def _init_run(self) -> "TestRun":
"""Initialize processes for the test run."""
tr = Test.AddTestRun(self.name)
server_one = TestRamdiskRedirect.configure_server(tr, "oof.com")
self._configure_traffic_server(tr, server_one)
dns = TestRamdiskRedirect.configure_dns(tr, server_one, self._dns_port)
tr.Processes.Default.StartBefore(server_one)
tr.Processes.Default.StartBefore(dns)
tr.Processes.Default.StartBefore(self._ts)
return {
"tr": tr,
"ts": self._ts,
"server_one": server_one,
"port_one": self._port_one,
}
@classmethod
def runner(cls, name: str, autorun: bool = True) -> Optional[Callable]:
"""Create a runner for a test case.
:param autorun: Run the test case once it's set up. Default is True.
"""
test = cls(name, autorun=autorun)._prepare_test_case
return test
def _prepare_test_case(self, func: Callable) -> Callable:
"""Set up a test case and possibly run it.
:param func: The test case to set up.
"""
functools.wraps(func)
test_params = self._init_run()
def wrapper(*args, **kwargs) -> Any:
return func(test_params, *args, **kwargs)
if self.autorun:
wrapper()
return wrapper
@staticmethod
def configure_server(tr: "TestRun", domain: str):
server = tr.AddVerifierServerProcess(f"server{TestRamdiskRedirect.server_counter}.{domain}",
TestRamdiskRedirect.replay_filepath_one, other_args="--format \"{url}\"")
TestRamdiskRedirect.server_counter += 1
return server
@staticmethod
def configure_dns(tr: "TestRun", server_one: "Process", dns_port: int):
dns = tr.MakeDNServer("dns", port=dns_port, default="127.0.0.1")
return dns
def _configure_traffic_server(self, tr: "TestRun", server_one: "Process"):
"""Configure Traffic Server.
:param tr: The TestRun object to associate the ts process with.
"""
ts = tr.MakeATSProcess(f"ts-{TestRamdiskRedirect.ts_counter}", select_ports=False, enable_tls=True, enable_cache=True)
TestRamdiskRedirect.ts_counter += 1
self._port_one = get_port(ts, "PortOne")
self._dns_port = get_port(ts, "DNSPort")
ts.Disk.records_config.update(
{
'proxy.config.http.server_ports': f"{self._port_one}",
'proxy.config.diags.debug.enabled': 1,
'proxy.config.diags.debug.tags': "cache|dns|http|redirect|remap",
'proxy.config.dns.nameservers': f"127.0.0.1:{self._dns_port}",
'proxy.config.dns.resolv_conf': 'NULL',
'proxy.config.http.redirect.actions': "self:follow",
'proxy.config.http.number_of_redirections': 1,
})
ts.Disk.remap_config.AddLine(f"map oof.com http://oof.backend.com:{server_one.Variables.http_port}")
ts.Disk.storage_config.AddLine("storage 128M")
ts.Disk.hosting_config.AddLine("hostname=* volume=1")
ts.Disk.volume_config.AddLine("volume=1 scheme=http size=100% ramcache=false")
self._ts = ts
# Tests start.
@TestRamdiskRedirect.runner("Redirect to same origin with cache - boom!")
def test1(params: TestParams) -> None:
# This sends a request to origin, which returns 307 temporary redirect with
# a different path on the same origin. ATS is configured to follow the
# redirect, so it prepares to hit the same origin with the new path in the
# `location` header of the 307 response. Before issuing the request to
# origin, we just need to check the cache. The lock on the cache hasn't
# been released after the 307 response, so ATS hangs, and the connection
# dies after a timeout.
client = params["tr"].AddVerifierClientProcess(
f"client0", TestRamdiskRedirect.replay_filepath_one, http_ports=[params["port_one"]], other_args="--format \"{url}\" --keys \"/a/path/resource\"")
params["tr"].Processes.Default.ReturnCode = 0
Reproduced on 10.0.x.
Reproduced on latest ATS (e274dd1e076effc5ea9fc2eb51f549c6f5820448).
This is the last few lines of traffic.out, showing the state transition from the redirect to a cache lookup, followed by a 5 second hang and the connection dying. Turning off the cache (by setting enable_cache=False in the MakeATSProcess call) in the previously posted AuTest will make the redirect succeed.
[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpSM.cc:7782 (call_transact_and_set_next_state)> (http) [0] State Transition: SM_ACTION_REDIRECT_READ -> SM_ACTION_CACHE_LOOKUP
[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpSM.cc:4947 (do_cache_lookup_and_read)> (http_seq) [0] Issuing cache lookup for URL http://oof.backend.com:61000/a/path/new_resource
[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpCacheSM.cc:106 (state_cache_open_read)> (http_cache) [0] [&HttpCacheSM::state_cache_open_read, CACHE_EVENT_OPEN_READ_FAILED/TS_EVENT_CACHE_OPEN_READ_FAILED]
[Jul 10 18:51:24.359] [ET_NET 1] DIAG: <HttpSM.cc:2608 (main_handler)> (http) [0] VC_EVENT_EOS/TS_EVENT_VCONN_EOS, 104
[Jul 10 18:51:24.359] [ET_NET 1] DIAG: <HttpSM.cc:831 (state_watch_for_client_abort)> (http) [0] [&HttpSM::state_watch_for_client_abort, VC_EVENT_EOS/TS_EVENT_VCONN_EOS]
Note that the above AuTest is a simplified version of the detailed steps by @bdgranger. In this setup the origin redirects to itself, and the client's first request fails. The steps described in the original report require one request by the client for ATS to grab the lock, and a second request for ATS to hang, since the redirect hits a new origin that doesn't use the same cache volume.
The issue still persists after adding the following config to the AuTest:
ts.Disk.parent_config.AddLine("dest_domain=. parent="" round_robin=consistent_hash go_direct=false qstring=ignore")
EDIT: Looks like this failed the test because this config wasn't valid in ATS master.
~~In ATS master this fails on the cache read, and doesn't even make it to a cache miss state. With debug enabled, it triggers ink_asserts in the HttpCacheSM. If the explicit cache configs are removed, but the cache is enabled, then it fails taking out the write lock as described in the original report.~~