trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

cache write lock not released when following redirect and no parent

Open bdgranger opened this issue 3 years ago • 5 comments

Configuration

  • ATS 8.1.x, single-tier CDN (no parent caches, just edge caches and origins)
  • proxy.config.http.parent_proxy_routing_enable set to 0 or proxy.config.http.parent_proxy_routing_enable set to 1 with empty parent.config
  • origin is a redirecting origin that always responds with a 307 to the "actual" origin
  • hosting.config uses volume 2 for host of original request, but has no entry for IP address of redirected host

The basic gist of what is happening is:

  • the first request for an asset on a DS comes in and is remapped to origin, which is mapped to volume 2 (ramdisk) by hosting.config
  • the object is not found and results in the expected cache miss
  • ATS opens a write lock on the ramdisk for the asset and issues a request to the origin to fetch it
  • the origin responds with a 307
  • since ATS is configured to follow redirections, a new read request is issued with the new URL
  • the host of the new URL is not mapped to ramdisk by hosting.config, so ATS looks for the object on the "generic" volume 1
  • the object is also not found on volume 1, so ATS takes a write lock on that volume and issues a request to the "real origin" to fetch
  • object is fetched successfully, written to volume 1, and streamed to client
  • volume 1 cache write lock is released
  • volume 2 (ramdisk) cache write lock is never released
  • When subsequent requests for the same asset are received (read_while_writer enabled):
    • ATS looks for the object in ramdisk, since that is were the remapped origin is supposed to store content
    • Because of the write lock, read waits for any content to appear in the ramdisk cache, which will never happen
    • read retries the configured number of times, then fails and issues its own write request to origin
    • ATS fails the configured number of times to open the write lock (because the first request still has it), then just goes straight to origin
    • origin returns a 307 response to the same location as before and ATS follows it
    • ATS looks in volume 1 based on the host of the "real origin", finds the content in cache and returns a HIT
    • but by this time the client has already given up and aborted, depending on how many retries were configured and how long each waited

We tried setting proxy.config.http.redirect_use_orig_cache_key to 1, but it appears to make no difference.

We saw this when a customer upgraded from a system that had previously been using ATC 3.x which never generated empty parent.config files. If we restore the following default line to parent.config which actually has no parents in it, it appears to solve the issue:

dest_domain=. parent="" round_robin=consistent_hash go_direct=false qstring=ignore

The difference appears to be that in the case of empty parent.config or proxy routing enable set to 0, the 307 is treated via HandleCacheMiss and a new ISSUE_WRITE is performed which opens a new CacheVC on the default volume. When the default line is in parent.config, this code is executed instead:

else if (s->dns_info.lookup_name[0] <= '9' && s->dns_info.lookup_name[0] >= '0' && s->parent_params->parent_table->hostMatch &&
             !s->http_config_param->no_dns_forward_to_parent) {
    // note, broken logic: ACC fudges the OR stmt to always be true,
    // 'AuthHttpAdapter' should do the rev-dns if needed, not here .
    TRANSACT_RETURN(SM_ACTION_DNS_REVERSE_LOOKUP, HttpTransact::StartAccessControl);
  }

In this case there is a message from decideCacheLookup "will NOT do lookup" and ATS just reuses the CacheVC that was opened for the original lookup and actually writes to the ramdisk so that future requests for the object get the immediate cache hit in the ramdisk as expected. But this looks like it only worked because the redirect was to an IP address and not to another fqdn.

It seems either something isn't quite right with the redirect following or something that is supposed to free the first CacheVC got missed, leaving the write lock in place.

bdgranger avatar Dec 22 '22 21:12 bdgranger

@bdgranger does the same problem exist in 9.1?

ywkaras avatar Jan 17 '23 00:01 ywkaras

@bdgranger does the same problem exist in 9.1?

@ywkaras we will have to test this. Will get back to you. Looks like the OSDNSLookup() method has substantial changes since 8.1.x branch

bdgranger avatar Jan 17 '23 16:01 bdgranger

@bdgranger any updates?

ywkaras avatar Jun 06 '23 22:06 ywkaras

@ywkaras

Sorry it took me so long to notice this question. As of now, I have been tied up in other issues and have not been able to test with 9.x. We have worked around the issue for now by making sure that parent.config always has a default "dest_domain=. ..." line in it and there is then no problem.

bdgranger avatar Jun 19 '23 16:06 bdgranger

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

github-actions[bot] avatar Jun 19 '24 01:06 github-actions[bot]

#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

import functools
from typing import Any, Callable, Dict, Optional

from ports import get_port

Test.Summary = 'Tests SNI port-based routing'

TestParams = Dict[str, Any]


class TestRamdiskRedirect:
    """Configure a test for reproducing #9275."""

    replay_filepath_one: str = "ramdisk-redirect.replay.yaml"
    replay_filepath_two: str = "ramdisk-redirect-ok.replay.yaml"
    client_counter: int = 0
    server_counter: int = 0
    ts_counter: int = 0

    def __init__(self, name: str, /, autorun) -> None:
        """Initialize the test.

        :param name: The name of the test.
        """
        self.name = name
        self.autorun = autorun

    def _init_run(self) -> "TestRun":
        """Initialize processes for the test run."""

        tr = Test.AddTestRun(self.name)
        server_one = TestRamdiskRedirect.configure_server(tr, "oof.com")
        self._configure_traffic_server(tr, server_one)
        dns = TestRamdiskRedirect.configure_dns(tr, server_one, self._dns_port)

        tr.Processes.Default.StartBefore(server_one)
        tr.Processes.Default.StartBefore(dns)
        tr.Processes.Default.StartBefore(self._ts)

        return {
            "tr": tr,
            "ts": self._ts,
            "server_one": server_one,
            "port_one": self._port_one,
        }

    @classmethod
    def runner(cls, name: str, autorun: bool = True) -> Optional[Callable]:
        """Create a runner for a test case.

        :param autorun: Run the test case once it's set up. Default is True.
        """
        test = cls(name, autorun=autorun)._prepare_test_case
        return test

    def _prepare_test_case(self, func: Callable) -> Callable:
        """Set up a test case and possibly run it.

        :param func: The test case to set up.
        """
        functools.wraps(func)
        test_params = self._init_run()

        def wrapper(*args, **kwargs) -> Any:
            return func(test_params, *args, **kwargs)

        if self.autorun:
            wrapper()
        return wrapper

    @staticmethod
    def configure_server(tr: "TestRun", domain: str):
        server = tr.AddVerifierServerProcess(f"server{TestRamdiskRedirect.server_counter}.{domain}",
                                             TestRamdiskRedirect.replay_filepath_one, other_args="--format \"{url}\"")
        TestRamdiskRedirect.server_counter += 1

        return server

    @staticmethod
    def configure_dns(tr: "TestRun", server_one: "Process", dns_port: int):
        dns = tr.MakeDNServer("dns", port=dns_port, default="127.0.0.1")
        return dns


    def _configure_traffic_server(self, tr: "TestRun", server_one: "Process"):
        """Configure Traffic Server.

        :param tr: The TestRun object to associate the ts process with.
        """
        ts = tr.MakeATSProcess(f"ts-{TestRamdiskRedirect.ts_counter}", select_ports=False, enable_tls=True, enable_cache=True)
        TestRamdiskRedirect.ts_counter += 1

        self._port_one = get_port(ts, "PortOne")
        self._dns_port = get_port(ts, "DNSPort")
        ts.Disk.records_config.update(
            {
                'proxy.config.http.server_ports': f"{self._port_one}",
                'proxy.config.diags.debug.enabled': 1,
                'proxy.config.diags.debug.tags': "cache|dns|http|redirect|remap",
                'proxy.config.dns.nameservers': f"127.0.0.1:{self._dns_port}",
                'proxy.config.dns.resolv_conf': 'NULL',
                'proxy.config.http.redirect.actions': "self:follow",
                'proxy.config.http.number_of_redirections': 1,
            })

        ts.Disk.remap_config.AddLine(f"map oof.com http://oof.backend.com:{server_one.Variables.http_port}")

        ts.Disk.storage_config.AddLine("storage 128M")

        ts.Disk.hosting_config.AddLine("hostname=* volume=1")

        ts.Disk.volume_config.AddLine("volume=1 scheme=http size=100% ramcache=false")

        self._ts = ts


# Tests start.

@TestRamdiskRedirect.runner("Redirect to same origin with cache - boom!")
def test1(params: TestParams) -> None:
    # This sends a request to origin, which returns 307 temporary redirect with
    # a different path on the same origin. ATS is configured to follow the
    # redirect, so it prepares to hit the same origin with the new path in the
    # `location` header of the 307 response. Before issuing the request to
    # origin, we just need to check the cache. The lock on the cache hasn't
    # been released after the 307 response, so ATS hangs, and the connection
    # dies after a timeout.
    client = params["tr"].AddVerifierClientProcess(
        f"client0", TestRamdiskRedirect.replay_filepath_one, http_ports=[params["port_one"]], other_args="--format \"{url}\" --keys \"/a/path/resource\"")

    params["tr"].Processes.Default.ReturnCode = 0

JosiahWI avatar Jul 10 '24 18:07 JosiahWI

Reproduced on 10.0.x.

JosiahWI avatar Jul 10 '24 18:07 JosiahWI

Reproduced on latest ATS (e274dd1e076effc5ea9fc2eb51f549c6f5820448).

This is the last few lines of traffic.out, showing the state transition from the redirect to a cache lookup, followed by a 5 second hang and the connection dying. Turning off the cache (by setting enable_cache=False in the MakeATSProcess call) in the previously posted AuTest will make the redirect succeed.

[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpSM.cc:7782 (call_transact_and_set_next_state)> (http) [0] State Transition: SM_ACTION_REDIRECT_READ -> SM_ACTION_CACHE_LOOKUP
[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpSM.cc:4947 (do_cache_lookup_and_read)> (http_seq) [0] Issuing cache lookup for URL http://oof.backend.com:61000/a/path/new_resource
[Jul 10 18:51:19.379] [ET_NET 1] DIAG: <HttpCacheSM.cc:106 (state_cache_open_read)> (http_cache) [0] [&HttpCacheSM::state_cache_open_read, CACHE_EVENT_OPEN_READ_FAILED/TS_EVENT_CACHE_OPEN_READ_FAILED]
[Jul 10 18:51:24.359] [ET_NET 1] DIAG: <HttpSM.cc:2608 (main_handler)> (http) [0] VC_EVENT_EOS/TS_EVENT_VCONN_EOS, 104
[Jul 10 18:51:24.359] [ET_NET 1] DIAG: <HttpSM.cc:831 (state_watch_for_client_abort)> (http) [0] [&HttpSM::state_watch_for_client_abort, VC_EVENT_EOS/TS_EVENT_VCONN_EOS]

JosiahWI avatar Jul 10 '24 18:07 JosiahWI

Note that the above AuTest is a simplified version of the detailed steps by @bdgranger. In this setup the origin redirects to itself, and the client's first request fails. The steps described in the original report require one request by the client for ATS to grab the lock, and a second request for ATS to hang, since the redirect hits a new origin that doesn't use the same cache volume.

JosiahWI avatar Jul 10 '24 19:07 JosiahWI

The issue still persists after adding the following config to the AuTest:

        ts.Disk.parent_config.AddLine("dest_domain=. parent="" round_robin=consistent_hash go_direct=false qstring=ignore")

EDIT: Looks like this failed the test because this config wasn't valid in ATS master.

JosiahWI avatar Jul 10 '24 19:07 JosiahWI

~~In ATS master this fails on the cache read, and doesn't even make it to a cache miss state. With debug enabled, it triggers ink_asserts in the HttpCacheSM. If the explicit cache configs are removed, but the cache is enabled, then it fails taking out the write lock as described in the original report.~~

JosiahWI avatar Jul 11 '24 13:07 JosiahWI