quilkin WIP: Fix for flaky agent <-> relay connections

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup /kind documentation /kind feature /kind hotfix

What this PR does / Why we need it:

Initial connection to a relay can get hung, this is an attempt at forcing a reconnection attempt on timeout.

Please take a look and let me know what you think of the approach.

Which issue(s) this PR fixes:

Work on #877

Special notes for your reviewer:

Left in my debugging logging, since this is draft. Will clear out when we're happy with the approach.

Running the test relay::tests::agones_token_router no longer fails. I ran it 100 times and it passed, whereas it would often fail <= the 20th iteration.

Once weird thing I definitely see with this though, if a retry of the initial connection does happen, I see this a lot in the Agent logs. The integration test still passes, but I fear that's only because I set the --idle-request-interval-secs to 5s.

{
  "insertId": "r1zix3op9jh9f7ur",
  "jsonPayload": {
    "timestamp": "2023-11-29T00:46:13.971649Z",
    "span": {
      "name": "handle_discovery_response"
    },
    "filename": "src/net/xds/client.rs",
    "target": "quilkin::net::xds::client",
    "spans": [
      {
        "name": "handle_discovery_response"
      }
    ],
    "level": "WARN",
    "fields": {
      "message": "lost connection to relay server, retrying"
    }
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "quilkin-mark-dev",
      "location": "us-west1-c",
      "pod_name": "quilkin-agones-agent-75b75d7494-28fjf",
      "container_name": "quilkin",
      "cluster_name": "agones",
      "namespace_name": "1701218738"
    }
  },
  "timestamp": "2023-11-29T00:46:13.971916187Z",
  "severity": "WARNING",
  "labels": {
    "k8s-pod/pod-template-hash": "75b75d7494",
    "compute.googleapis.com/resource_name": "gke-agones-default-ad8cd7e5-nq83",
    "k8s-pod/role": "agent"
  },
  "logName": "projects/quilkin-mark-dev/logs/stdout",
  "receiveTimestamp": "2023-11-29T00:46:15.465040550Z"
}

While this fix does provide greater reliability on the Agent side (so we may want to merge it anyway) -- I am wondering if it's masking an issue on the relay side?

Would love your thoughts!

Nov 29 '23 01:11 markmandel

huh interesting, I can't get test cli::tests::relay_routing to fail locally, but definitely fails here, which makes me think this is all related.

thread 'cli::tests::relay_routing' panicked at src/cli.rs:410:22:
should have received a packet: Elapsed(())
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1652:5
   3: core::result::Result<T,E>::expect
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1034:23
   4: quilkin::cli::tests::relay_routing::{{closure}}
             at ./src/cli.rs:408:17
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/future/future.rs:125:9
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/future/future.rs:125:9
   7: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:57
   8: tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
  10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:410:19
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:664:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:26
  16: std::thread::local::LocalKey<T>::try_with
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/thread/local.rs:270:16
  17: std::thread::local::LocalKey<T>::with
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/thread/local.rs:246:9
  18: tokio::runtime::context::set_scheduler
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:652:19
  21: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  24: tokio::runtime::runtime::Runtime::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:348:47
  25: quilkin::cli::tests::relay_routing
             at ./src/cli.rs:383:9
  26: quilkin::cli::tests::relay_routing::{{closure}}
             at ./src/cli.rs:262:30
  27: core::ops::function::FnOnce::call_once
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/ops/function.rs:250:5
  28: core::ops::function::FnOnce::call_once
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2023-11-29T01:31:10.919013Z DEBUG quilkin::cli::proxy::sessions: Closing upstream socket loop
test cli::tests::relay_routing ... FAILED

I'm going to rerun the CI suite, see if it's a consistent failure or a flake.

Nov 29 '23 16:11 markmandel

Oh interesting - on telling GitHub to rerun the check, the notifier bot doesn't run. But this time it passes.

🤔 so this doesn't solve the issue, but would like to see response as to the general approach (would also love some time to step through the agent + xds client/server code at some point to see if we can see something toghether.

Nov 29 '23 17:11 markmandel

(would also love some time to step through the agent + xds client/server code at some point to see if we can see something toghether.

Sure, I'm bit a bit limited on time at the moment though, but we can work on it at some point.

I have looked through the code though and nothing stands out to me, it feels like a lower level bug or a timing issue, as it's not something I can replicate outside of tests. It might be fixable by just increasing the time between launching one service and then another.

Nov 29 '23 18:11 XAMPPRocky

Thanks, I'll keep poking at it. I am strongly considering if it's what is causing #861 - theory being, the gRPC connection drops for some reason, the agent attempts a retry of the connection, but falls into the same deadlock we see in these tests, and therefore communication fails.

I'm tempted to also try pulling in changes from #836 as you said it fixes things -- but looks like you closed that PR and gave up that work. Do you think it's worth attempting to pull parts of that in, to see if it solves problems?

Nov 29 '23 19:11 markmandel

Well I closed that PR because nearly everything has been cherry-picked into main now or we've implemented better solutions than the hacks I added.

Nothing in that PR really touched the xDS communication, which is what made it so strange that when I had that issue I couldn't replicate it with that PR.

Nov 29 '23 19:11 XAMPPRocky

Well I closed that PR because nearly everything has been cherry-picked into main now or we've implemented better solutions than the hacks I added.

Makes sense.

Nothing in that PR really touched the xDS communication, which is what made it so strange that when I had that issue I couldn't replicate it with that PR.

It did touch the CLI code, so i was wondering if there was something better in there 🤔 okay, if I get really stuck I might do a diff between main and that branch see if anything stands out.

Nov 29 '23 19:11 markmandel

Build Failed :sob:

Build Id: c63a6e75-b693-450c-8ba9-6c1b7a34c0de

Status: FAILURE

To get permission to view the Cloud Build view, join the quilkin-discuss Google Group.

Filter with the Git Commit be4bbeb within us-docker.pkg.dev/quilkin/ci//quilkin to see if a development image is available from this build, and can be used for debugging purposes.

Development images are retained for at least 30 days.

Jan 04 '24 01:01 quilkin-bot

Going to close this since it's out of date.

Spent days on this, still not sure what it is.

Apr 23 '24 17:04 markmandel