nomad icon indicating copy to clipboard operation
nomad copied to clipboard

nomad does not register HTTP tag for server in Consul

Open BrianHicks opened this issue 1 year ago • 3 comments

Nomad version

Nomad v1.8.0

Operating system and Environment details

NixOS 24.05 running on Hetzner cloud VMs.

Issue

When advertise.http is set, Nomad is not registering a http tag with Consul. rpc and serf are registered, though.

(This is blocking me from scraping job metrics with Prometheus.)

Reproduction steps

Run Nomad using this config:

{
  "acl": {
    "enabled": true
  },
  "advertise": {
    "http": "{{ GetInterfaceIP \"enp7s0\" }}",
    "rpc": "{{ GetInterfaceIP \"enp7s0\" }}",
    "serf": "{{ GetInterfaceIP \"enp7s0\" }}"
  },
  "consul": {
    "address": "127.0.0.1:8501",
    "ssl": true
  },
  "data_dir": "/var/lib/nomad",
  "datacenter": "us-east",
  "log_level": "TRACE",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "server": {
    "bootstrap_expect": 1,
    "enabled": true
  },
  "telemetry": {
    "collection_interval": "1s",
    "disable_hostname": true,
    "prometheus_metrics": true,
    "publish_allocation_metrics": true,
    "publish_node_metrics": true
  },
  "tls": {
    "ca_file": "[SNIP]",
    "cert_file": "[SNIP]",
    "http": true,
    "key_file": "[SNIP]",
    "rpc": true,
    "verify_https_client": false,
    "verify_server_hostname": true
  },
  "ui": {
    "enabled": true
  }
}

(Plus a side config I have not shared that sets consul.token.)

Expected Result

Nomad registers a nomad service with http, rpc, and serf tags.

Actual Result

Nomad only registers rpc and serf tags.

Nomad Server logs (if appropriate)

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /etc/nomad.d/consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: INFO
               Node Id: 60100119-2101-5fe3-1fc7-887d6a5dab36
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-19T05:42:49.734Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-19T05:42:49.736Z [INFO]  nomad.raft: starting restore from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: snapshot restore progress: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159 read-bytes=298159 percent-complete="100.00%"
    2024-06-19T05:42:49.760Z [INFO]  nomad.raft: restored from snapshot: id=15-23927-1718755211021 last-index=23927 last-term=15 size-in-bytes=298159
    2024-06-19T05:42:49.770Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-19T05:42:49.771Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-19T05:42:49.771Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.771Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-19T05:42:49.773Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-19T05:42:49.774Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-19T05:42:49.774Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-19T05:42:51.014Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-19T05:42:51.014Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=32
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: election won: term=32 tally=1
    2024-06-19T05:42:51.016Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-19T05:42:51.016Z [INFO]  nomad: cluster leadership acquired
    2024-06-19T05:42:51.055Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-19T05:42:51.055Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14

BrianHicks avatar Jun 19 '24 05:06 BrianHicks

Hi @BrianHicks! I wasn't able to reproduce what you're seeing on either 1.8.0 or the current tip of main. I also played around with HCL vs JSON configuration and wasn't able to see a difference there either. The weird thing about this is that we create and register those services all at the same time: agent.go#L961-L1009

If you were to run the server with log_level = "debug" you'd see a message during startup about syncing to Consul like the one below. What's that look like?

2024-06-21T15:47:51.388-0400 [DEBUG] consul.sync: sync complete: registered_services=3 deregistered_services=0 registered_checks=3 deregistered_checks=0

Also, if you run the following command against one of the servers, what's the response body look like?

nomad operator api '/v1/agent/self' | jq '.config.Consuls'

tgross avatar Jun 21 '24 20:06 tgross

How interesting! I don't see any such message when running in debug; here's the output:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /etc/nomad.json, /run/agenix/nomad-consul-token.json
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: 10.0.1.0:4646; RPC: 10.0.1.0:4647; Serf: 10.0.1.0:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: false
             Log Level: DEBUG
               Node Id: 2f248988-a9b9-265f-6f60-ff48eeb337d7
                Region: global (DC: us-east)
                Server: true
               Version: 1.8.0
==> Nomad agent started! Log data will stream in below:
    2024-06-22T00:25:24.879Z [DEBUG] nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
    2024-06-22T00:25:24.884Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-06-22T00:25:24.886Z [INFO]  nomad.raft: starting restore from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.910Z [INFO]  nomad.raft: snapshot restore progress: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574 read-bytes=276574 percent-complete="100.00%"
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: restored from snapshot: id=35-28895-1719014431549 last-index=28895 last-term=35 size-in-bytes=276574
    2024-06-22T00:25:24.911Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:35e115c2-34da-f3ba-8579-e8e122ba3dfd Address:10.0.1.0:4647}]"
    2024-06-22T00:25:24.912Z [INFO]  nomad: serf: EventMemberJoin: leader-red.global 10.0.1.0
    2024-06-22T00:25:24.912Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=4b6681a1-493b-d115-a9df-076f99145c65 index=1 of=2
    2024-06-22T00:25:24.912Z [DEBUG] nomad: started scheduling worker: id=22ce07f8-09a0-8b34-913b-af8585da9491 index=2 of=2
    2024-06-22T00:25:24.912Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["service", "batch", "system", "sysbatch", "_core"]
    2024-06-22T00:25:24.912Z [DEBUG] http: UI is enabled
    2024-06-22T00:25:24.913Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.1.0:4647 [Follower]" leader-address= leader-id=
    2024-06-22T00:25:24.914Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=4b6681a1-493b-d115-a9df-076f99145c65
    2024-06-22T00:25:24.914Z [DEBUG] worker: running: worker_id=22ce07f8-09a0-8b34-913b-af8585da9491
    2024-06-22T00:25:24.914Z [INFO]  nomad: adding server: server="leader-red.global (Addr: 10.0.1.0:4647) (DC: us-east)"
    2024-06-22T00:25:24.914Z [DEBUG] nomad.keyring.replicator: starting encryption key replication
    2024-06-22T00:25:26.507Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
    2024-06-22T00:25:26.507Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.1.0:4647 [Candidate]" term=36
    2024-06-22T00:25:26.508Z [DEBUG] nomad.raft: voting for self: term=36 id=35e115c2-34da-f3ba-8579-e8e122ba3dfd
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: calculated votes needed: needed=1 term=36
    2024-06-22T00:25:26.510Z [DEBUG] nomad.raft: vote granted: from=35e115c2-34da-f3ba-8579-e8e122ba3dfd term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: election won: term=36 tally=1
    2024-06-22T00:25:26.510Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.1.0:4647 [Leader]"
    2024-06-22T00:25:26.510Z [INFO]  nomad: cluster leadership acquired
    2024-06-22T00:25:26.518Z [INFO]  nomad: eval broker status modified: paused=false
    2024-06-22T00:25:26.518Z [INFO]  nomad: blocked evals status modified: paused=false
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: autopilot is now running
    2024-06-22T00:25:26.518Z [DEBUG] nomad.autopilot: state update routine is now running
    2024-06-22T00:25:26.518Z [INFO]  nomad: revoking consul accessors after becoming leader: accessors=14

And here's the output of the command:

[
  {
    "Addr": "127.0.0.1:8501",
    "AllowUnauthenticated": true,
    "Auth": "",
    "AutoAdvertise": true,
    "CAFile": "",
    "CertFile": "",
    "ChecksUseAdvertise": false,
    "ClientAutoJoin": true,
    "ClientFailuresBeforeCritical": 0,
    "ClientFailuresBeforeWarning": 0,
    "ClientHTTPCheckName": "Nomad Client HTTP Check",
    "ClientServiceName": "nomad-client",
    "EnableSSL": true,
    "GRPCAddr": "",
    "GRPCCAFile": "",
    "KeyFile": "",
    "Name": "default",
    "Namespace": "",
    "ServerAutoJoin": true,
    "ServerFailuresBeforeCritical": 0,
    "ServerFailuresBeforeWarning": 0,
    "ServerHTTPCheckName": "Nomad Server HTTP Check",
    "ServerRPCCheckName": "Nomad Server RPC Check",
    "ServerSerfCheckName": "Nomad Server Serf Check",
    "ServerServiceName": "nomad",
    "ServiceIdentity": null,
    "ServiceIdentityAuthMethod": "nomad-workloads",
    "ShareSSL": null,
    "Tags": null,
    "TaskIdentity": null,
    "TaskIdentityAuthMethod": "nomad-workloads",
    "Timeout": 5000000000,
    "Token": "<redacted>",
    "VerifySSL": true
  }
]

BrianHicks avatar Jun 22 '24 00:06 BrianHicks

Thanks @BrianHicks. I see that your Consul configuration doesn't have a CAFile or CertFile, but that you're connecting to Consul on port 8501 which is the Consul default port for https. Is there any chance it's just the wrong port, so Nomad can't find Consul at all?

I wouldn't expect to see any tags in that case, of course, but maybe the local agent has a cached version floating around from an earlier config?

tgross avatar Jun 27 '24 19:06 tgross

We didn't hear back on this so I'm going to close it out for now. If you have more information, we'll be happy to reopen.

tgross avatar Oct 23 '24 20:10 tgross

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Feb 21 '25 02:02 github-actions[bot]