Consul service mesh containers failing in Nomad due to token persistence after upgrade

Open lopcode opened this issue 1 year ago • 1 comments

Overview of the Issue

Hi there,

I just spent a couple of hours debugging an outage on my personal cluster after a cluster upgrade, and the fix was relatively simple, so wanted to raise an issue in case others are having the same problem.

I upgraded my personal cluster from Consul 1.20.0 -> 1.20.2, and Nomad 1.9.4 -> 1.9.5. Part of that upgrade process involves rebooting the servers, and the upgrade process has worked fine for a few years. This time, after doing so, all Consul service mesh / envoy containers started failing, and I was struggling to figure out why. The logs all looked like this:

[2025-02-01 15:30:17.148][1][info][admin] [source/server/admin/admin.cc:65] admin address: 127.0.0.2:19001
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:168] loading tracing configuration
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2025-02-01 15:30:17.149][1][info][config] [source/server/configuration_impl.cc:130] loading 1 cluster(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:138] loading 0 listener(s)
[2025-02-01 15:30:17.211][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2025-02-01 15:30:17.211][1][info][runtime] [source/common/runtime/runtime_impl.cc:625] RTDS has finished initialization
[2025-02-01 15:30:17.211][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2025-02-01 15:30:17.212][1][warning][main] [source/server/server.cc:936] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2025-02-01 15:30:17.212][1][info][main] [source/server/server.cc:978] starting main dispatch loop
[2025-02-01 15:30:17.221][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 0 cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 2 cluster(s), skipped 0 unmodified cluster(s)
[2025-02-01 15:30:17.366][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:223] cm init: initializing secondary clusters
[2025-02-01 15:30:17.368][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2025-02-01 15:30:17.368][1][info][main] [source/server/server.cc:958] all clusters initialized. initializing init manager
[2025-02-01 15:30:17.375][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'public_listener:0.0.0.0:31944'
[2025-02-01 15:30:17.376][1][info][upstream] [source/common/listener_manager/lds_api.cc:106] lds: add/update listener 'shared-redis:127.0.0.1:6379'
[2025-02-01 15:30:17.376][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:930] all dependencies initialized. starting workers
[2025-02-01 15:31:22.570][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:176] DeltaAggregatedResources gRPC config stream to local_agent closed: 13, 
[2025-02-01 15:31:22.608][1][warning][main] [source/server/server.cc:907] caught ENVOY_SIGTERM
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:1046] shutting down server instance
[2025-02-01 15:31:22.608][1][info][main] [source/server/server.cc:986] main dispatch loop exited
[2025-02-01 15:31:22.613][1][info][main] [source/server/server.cc:1038] exiting

So, startup looked OK, "starting workers", a 5 second pause, and then "gRPC config stream to local_agent closed: 13". This persisted across multiple reboots, a rollback, and all attempts on all jobs over the space of an hour or so.

After some searching I found that some folks had success for (seemingly) unrelated issues setting acl.enable_token_persistence to false, which also fixed my issue - after setting that and restarting, the sidecar workloads immediately started working again.

My guess is that somehow a token got broken/corrupted/wiped out after a server reboot (which is odd in itself). I'm quite certain that changing this flag is what fixed the problem (but there's always a small chance that the Nth restart fixed the underlying problem).

My suggestion is that, if you think this might be an issue, it could potentially be added to https://support.hashicorp.com/hc/en-us/articles/5295078989075-Resolving-Common-Errors-in-Envoy-Proxy-A-Troubleshooting-Guide - there's already an example with "error 14" but that's slightly different to my problem.

I have no idea how I would reproduce this, or what info might be helpful, so let me know if I can give you anything to help. Some basic info:

Consul versions: 1.20.0 -> 1.20.2
Nomad versions 1.9.4 -> 1.9.5
Using TLS (except with gRPC verify_incoming set to false)
Using workload identity / ACL

Feb 01 '25 17:02 lopcode

This is the issue that helped me - interestingly, it includes the addition of a preflight check that I think was supposed to fix (or at least detect) something related: https://github.com/hashicorp/nomad/issues/20516

I'm also not sure whether this issue should live on the nomad repo, or this one - I changed a consul config option to fix things, so I put it here.

Feb 01 '25 17:02 lopcode