WIP: Fix for flaky agent <-> relay connections
What type of PR is this?
Uncomment only one
/kind <>line, press enter to put that in a new line, and remove leading whitespace from that line:/kind breaking
/kind bug
/kind cleanup /kind documentation /kind feature /kind hotfix
What this PR does / Why we need it:
Initial connection to a relay can get hung, this is an attempt at forcing a reconnection attempt on timeout.
Please take a look and let me know what you think of the approach.
Which issue(s) this PR fixes:
Work on #877
Special notes for your reviewer:
Left in my debugging logging, since this is draft. Will clear out when we're happy with the approach.
Running the test relay::tests::agones_token_router no longer fails. I ran it 100 times and it passed, whereas it would often fail <= the 20th iteration.
Once weird thing I definitely see with this though, if a retry of the initial connection does happen, I see this a lot in the Agent logs. The integration test still passes, but I fear that's only because I set the --idle-request-interval-secs to 5s.
{
"insertId": "r1zix3op9jh9f7ur",
"jsonPayload": {
"timestamp": "2023-11-29T00:46:13.971649Z",
"span": {
"name": "handle_discovery_response"
},
"filename": "src/net/xds/client.rs",
"target": "quilkin::net::xds::client",
"spans": [
{
"name": "handle_discovery_response"
}
],
"level": "WARN",
"fields": {
"message": "lost connection to relay server, retrying"
}
},
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "quilkin-mark-dev",
"location": "us-west1-c",
"pod_name": "quilkin-agones-agent-75b75d7494-28fjf",
"container_name": "quilkin",
"cluster_name": "agones",
"namespace_name": "1701218738"
}
},
"timestamp": "2023-11-29T00:46:13.971916187Z",
"severity": "WARNING",
"labels": {
"k8s-pod/pod-template-hash": "75b75d7494",
"compute.googleapis.com/resource_name": "gke-agones-default-ad8cd7e5-nq83",
"k8s-pod/role": "agent"
},
"logName": "projects/quilkin-mark-dev/logs/stdout",
"receiveTimestamp": "2023-11-29T00:46:15.465040550Z"
}
While this fix does provide greater reliability on the Agent side (so we may want to merge it anyway) -- I am wondering if it's masking an issue on the relay side?
Would love your thoughts!
huh interesting, I can't get test cli::tests::relay_routing to fail locally, but definitely fails here, which makes me think this is all related.
thread 'cli::tests::relay_routing' panicked at src/cli.rs:410:22:
should have received a packet: Elapsed(())
stack backtrace:
0: rust_begin_unwind
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
1: core::panicking::panic_fmt
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
2: core::result::unwrap_failed
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1652:5
3: core::result::Result<T,E>::expect
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/result.rs:1034:23
4: quilkin::cli::tests::relay_routing::{{closure}}
at ./src/cli.rs:408:17
5: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/future/future.rs:125:9
6: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/future/future.rs:125:9
7: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:57
8: tokio::runtime::coop::with_budget
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
9: tokio::runtime::coop::budget
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:25
11: tokio::runtime::scheduler::current_thread::Context::enter
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:410:19
12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:664:36
13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:68
14: tokio::runtime::context::scoped::Scoped<T>::set
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/scoped.rs:40:9
15: tokio::runtime::context::set_scheduler::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:26
16: std::thread::local::LocalKey<T>::try_with
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/thread/local.rs:270:16
17: std::thread::local::LocalKey<T>::with
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/thread/local.rs:246:9
18: tokio::runtime::context::set_scheduler
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:9
19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:27
20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:652:19
21: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:175:28
22: tokio::runtime::context::runtime::enter_runtime
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:167:9
24: tokio::runtime::runtime::Runtime::block_on
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:348:47
25: quilkin::cli::tests::relay_routing
at ./src/cli.rs:383:9
26: quilkin::cli::tests::relay_routing::{{closure}}
at ./src/cli.rs:262:30
27: core::ops::function::FnOnce::call_once
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/ops/function.rs:250:5
28: core::ops::function::FnOnce::call_once
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2023-11-29T01:31:10.919013Z DEBUG quilkin::cli::proxy::sessions: Closing upstream socket loop
test cli::tests::relay_routing ... FAILED
I'm going to rerun the CI suite, see if it's a consistent failure or a flake.
Oh interesting - on telling GitHub to rerun the check, the notifier bot doesn't run. But this time it passes.
🤔 so this doesn't solve the issue, but would like to see response as to the general approach (would also love some time to step through the agent + xds client/server code at some point to see if we can see something toghether.
(would also love some time to step through the agent + xds client/server code at some point to see if we can see something toghether.
Sure, I'm bit a bit limited on time at the moment though, but we can work on it at some point.
I have looked through the code though and nothing stands out to me, it feels like a lower level bug or a timing issue, as it's not something I can replicate outside of tests. It might be fixable by just increasing the time between launching one service and then another.
Thanks, I'll keep poking at it. I am strongly considering if it's what is causing #861 - theory being, the gRPC connection drops for some reason, the agent attempts a retry of the connection, but falls into the same deadlock we see in these tests, and therefore communication fails.
I'm tempted to also try pulling in changes from #836 as you said it fixes things -- but looks like you closed that PR and gave up that work. Do you think it's worth attempting to pull parts of that in, to see if it solves problems?
Well I closed that PR because nearly everything has been cherry-picked into main now or we've implemented better solutions than the hacks I added.
Nothing in that PR really touched the xDS communication, which is what made it so strange that when I had that issue I couldn't replicate it with that PR.
Well I closed that PR because nearly everything has been cherry-picked into main now or we've implemented better solutions than the hacks I added.
Makes sense.
Nothing in that PR really touched the xDS communication, which is what made it so strange that when I had that issue I couldn't replicate it with that PR.
It did touch the CLI code, so i was wondering if there was something better in there 🤔 okay, if I get really stuck I might do a diff between main and that branch see if anything stands out.
Build Failed :sob:
Build Id: c63a6e75-b693-450c-8ba9-6c1b7a34c0de
Status: FAILURE
To get permission to view the Cloud Build view, join the quilkin-discuss Google Group.
Filter with the Git Commit be4bbeb within us-docker.pkg.dev/quilkin/ci//quilkin to see if a development image is available from this build, and can be used for debugging purposes.
Development images are retained for at least 30 days.
Going to close this since it's out of date.
Spent days on this, still not sure what it is.