nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Logs through Nomad UI show certificate issues (log-viewing itself works)

Open dmclf opened this issue 1 year ago • 2 comments
trafficstars

Nomad version

Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad36851ec02f875e0814775ecf1df0229f0a615

and
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478

(but may not be limited to these)

Operating system and Environment details

Ubuntu 22.04.5 LTS

Nomad with

  • ACL enabled
  • TLS enabled on http + rpc
  • vault enabled
  • dev is single region
  • staging and production are federated multi-region.
  • VIP for a providing a singular entrypoint to the specific environment's Nomad-UI (and deployed nomad jobs/services)
    • consul used to generate config and Traefik then automatically routes that accordingly, together with Lets-encrypt gives a smooth TLS experience where developers , together with some CI/CD templating are able to deploy easily applications independently.
    • ie, dns wildcard *.nomad-development.company.com and you can easily deploy https://application.nomad-development.company.com (not fit for high-traffic/volume , but for low-traffic apps this works fine)

Issue

when checking the Nomad UI to look at container logs, there are errors reported due to certificate issues. image

  • my UI is on a Traefik loadbalancing with SSL and LetsEncrypt certs on , example, https://nomad-development.company.com
  • however, when requesting logs, these for some reason go to: https://10.xx.yy.zz:4646/v1/client/fs/logs/8fb55989-139c-5256-f812-d79353993c6c?follow=true&offset=50000&origin=end&task=athena-cleaner&type=stdout and as the Nomad-Servers are using a private CA, as per Nomad's recommendations
This should be a private CA and not a public one like Let's Encrypt
as any certificate signed by this CA will be allowed to communicate with the cluster

and as such, this shows these certificate errors on Enduser-devices.

note: log viewing still works as apparently that call goes back to the SSL traefik endpoint, example, https://nomad-development.company.com/v1/client/fs/logs/c519c888-6c46-6d8e-2f0c-f5a17be8afc7?follow=true&offset=50000&origin=end&task=google-cadvisor&type=stderr

this flow works

  graph TD;
      A[Browser]-- OK -->B[https:nomad-development.company.com = VIP+Traefik+LetsEncrypt];
      B-- OK -->E[Nomad-Server-1:4646 -- Private CA];
      B-- OK -->F[Nomad-Server-2:4646 -- Private CA];
      B-- OK -->G[Nomad-Server-3:4646 -- Private CA];

but the UI seems to do this 'direct connection' for the errornous calls, and that fails.

  graph TD;
      A[Browser];
      A-- FAIL -->E[https.Nomad-Server-1:4646 -- Private CA];
      A-- FAIL -->F[https.Nomad-Server-2:4646 -- Private CA];
      A-- FAIL -->G[https.Nomad-Server-3:4646 -- Private CA];

Reproduction steps

  • not specific, just noticed as
  • also happens on all environments,
    • nomad-development.company.com (v1.9.0 on servers, v1.8.3 on clients)
    • nomad-staging.company.com (v1.8.3)
    • nomad-production.company.com (v1.8.3)

Expected Result

  • no certificate errors

Actual Result

GET https://10.x.y.z:4646/v1/client/fs/logs/c519c888-6c46-6d8e-2f0c-f5a17be8afc7?follow=true&offset=50000&origin=end&task=google-cadvisor&type=stderr net::ERR_CERT_AUTHORITY_INVALID image image

i am not sure if there is some extra config needed in such case that is not readily available yet?

like, consul and vault have some ui_url that can be set:

  consul {
    ui_url = "https://consul.nomad-development.company.com/ui"
  }

  vault {
    ui_url = "https://vault.nomad-development.company.com/ui"
  }

perhaps such a property also is needed in my case, where effectively, nomad-ui sits behind a proxy?

(or some other config I may have overlooked? response rewriting is not exactly the direction I would prefer)

dmclf avatar Oct 22 '24 06:10 dmclf

Hi @dmclf, thanks for raising this issue. There is no further agent config for this, the way there is for consul/vault ui_urls. I suspect you've already given https://developer.hashicorp.com/nomad/tutorials/manage-clusters/reverse-proxy-ui a look since you've arrived at a nice environment behind Traefik, but that guide doesn't have any certificate-specific advice anyway.

This is to say: first time I've heard of this particular issue, but not the first time I've seen issues raised around the proxied UI (for example). This could use some further investigation and I will try to set some time to dig in soon.

philrenaud avatar Oct 22 '24 13:10 philrenaud

hi @philrenaud , I guess example issue 6413 sounds a bit like my first version with Fabio, which worked fine, but indeed has its limitations.

I can elaborate more on the environment I setup, but that won't help this specific ticket (but may potentially be nice to know how people setup things? or help others with similar setups)

dmclf avatar Oct 22 '24 14:10 dmclf