vault icon indicating copy to clipboard operation
vault copied to clipboard

Support for AWS NLB "TCP Pings"

Open forty opened this issue 6 years ago • 16 comments

Is your feature request related to a problem? Please describe. We are using an AWS Network Load Balancer in front of Vault, which does the TLS termination, then connect to Vault instances using TLS as well, the health check is an HTTPS one on the /v1/sys/health. Everything works perfectly but our Vault logs are flooded by messages of this type:

Sep 13 08:04:29 vault02 vault[10592]: 2019-09-13T08:04:29.412Z [INFO] http: TLS handshake error from 10.10.1.27:54625: EOF

After thorough investigations, our best hypothesis is that they are due to AWS NLB "TCP pings" described here https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html

If you add a TLS listener to your Network Load Balancer, we perform a listener connectivity test. As TLS termination also terminates a TCP connection, a new TCP connection is established between your load balancer and your targets. Therefore, you might see the TCP pings for this test sent from your load balancer to the targets that are registered with your TLS listener. You can identify these TCP pings because they have the source IP address of your Network Load Balancer and the connections do not contain data packets.

Describe the solution you'd like Silently ignore those "TCP pings" (or at least have an option to do so) as Vault users could think something is wrong while everything is actually fine (plus it's flooding logs)

Describe alternatives you've considered Ignoring the warnings as they seem harmless

forty avatar Sep 13 '19 08:09 forty

I'm suspicious that there's an issue with the way the load balancer is configured to hit the Vault health endpoint. The error doesn't originate from Vault itself, but from one of Go's built-in libraries. There are many posts regarding the issue, this for example.

Would you be willing to share a couple more things?

  • Your Vault configuration
  • Your ELB healthcheck configuration for hitting Vault
  • Can you confirm you've configured certificates, if needed, as described here under "Step 3: Configure Security Settings"?

If you're still receiving that message after checking all that through, that should give us sufficient steps to reproduce the log line.

Thank you!

tyrannosaurus-becks avatar Sep 17 '19 20:09 tyrannosaurus-becks

Hello @tyrannosaurus-becks , thanks a lot for your answer. I'm happy to share whatever can help:

My vaut config:

{
  "ui": true,
  "pid_file": "/run/vault/vault.pid",
  "storage": {
    "consul": {
      "address": "unix:///var/local/consul/consul.sock"
    }
  },
  "listener": {
    "tcp": {
      "address": "0.0.0.0:8200",
      "tls_cert_file": "/etc/vault.d/server.cert",
      "tls_key_file": "/etc/vault.d/server.key"
    }
  },
  "seal": {
    "awskms": {
      "region": "eu-west-1",
      "kms_key_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
  }
}

My LB config (terraform config, which I think is worth many words ;) )

resource "aws_lb" "vault" {
  name                             = "${var.project_name}-vault-nlb"
  internal                         = true
  load_balancer_type               = "network"
  subnets                          = "${aws_subnet.main.*.id}"
  enable_cross_zone_load_balancing = true

  tags = var.tags
}

resource "aws_lb_target_group" "vault" {
  name        = "${var.project_name}-vault-nlb-tg"
  port        = 8200
  protocol    = "TLS"
  vpc_id      = "${aws_vpc.vpc.id}"
  target_type = "instance"

  health_check {
    path                = "/v1/sys/health"
    port                = "traffic-port"
    protocol            = "HTTPS"
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }

  tags = var.tags
}

resource "aws_acm_certificate" "certificate" {
  domain_name       = "${var.domain_name}"
  validation_method = "DNS"

  tags = var.tags

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_lb_listener" "vault" {
  load_balancer_arn = "${aws_lb.vault.arn}"
  port              = "443"
  protocol          = "TLS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = "${aws_acm_certificate.certificate.arn}"

  default_action {
    type             = "forward"
    target_group_arn = "${aws_lb_target_group.vault.arn}"
  }
}

resource "aws_vpc_endpoint_service" "vault" {
  acceptance_required        = true
  network_load_balancer_arns = ["${aws_lb.vault.arn}"]

  tags = merge(var.tags, {
    "Name" = "${var.project_name}_vault_vpces"
  })
}

forty avatar Sep 23 '19 12:09 forty

Wouldn't setting HealthCheckProtocol to HTTPS fix this problem?

jefferai avatar Nov 23 '19 16:11 jefferai

Isn't it already the case ? (see terraform config above)

forty avatar Nov 23 '19 19:11 forty

Maybe? Worth checking on AWS console probably.

jefferai avatar Nov 24 '19 00:11 jefferai

I checked in the AWS console, terraform works properly and the protocol for the healthcheck is HTTPS as configured in the tf file above. I assume the healthcheck would have failed if it was not done using HTTPS anyway (as noted, everything is working fine)

forty avatar Dec 04 '19 13:12 forty

Hi @forty, I came across this project: https://github.com/jen20/vault-health-checker. Wanted to share in case you still need a solution.

kwilczynski avatar Feb 28 '20 16:02 kwilczynski

@kwilczynski - that's what we did as well and it works nicely.

ftcjeff avatar Aug 19 '20 19:08 ftcjeff

Hi @ftcjeff, nice! Thank you for letting me know!

I am sure that @jen20 will be happy to know that his project solves this problem so nicely! It's great.

kwilczynski avatar Aug 19 '20 19:08 kwilczynski

@kwilczynski the project you mentioned states:

Unfortunately, the AWS NLB does not support HTTP health checks, instead supporting only TCP checks. While TCP checks can be pointed at a Vault server, they cannot determine the actual health of the instance, and fill the logs of the Vault server with spam related to unencrypted requests.

Ideally, the NLB will eventually support HTTP health checks and this project will become obsolete.

which is incorrect. My vault NLB is configured to do HTTPS health checks (as you can see in my TF config above), and I still have this issue.

forty avatar Aug 24 '20 08:08 forty

Running into the same issue here. It'd be nice if there was a flag or something that we can set to ignore tls warnings or even whitelist specific CIDR blocks or IP addresses. Or maybe make these types of warnings DEBUG level rather than INFO level.

kevingunn-wk avatar Oct 31 '20 05:10 kevingunn-wk

@forty It was correct at the time it’s was written (see the date stamp on the README!), but may no longer be.

jen20 avatar Feb 09 '21 13:02 jen20

https://github.com/golang/go/issues/26918

AlexanderYastrebov avatar Nov 19 '21 16:11 AlexanderYastrebov

This is still a problem in Ali Cloud, and SLBs there don't support HTTPS health checks. Just TCP and HTTP. Wouldn't it be possible to trap that log and move it to Debug instead of Info?

imranzunzani avatar Jul 14 '22 13:07 imranzunzani

when using NLB's you need to have your ec2 instances allow the SUBNET CIDR's, you cant grant them access from using the subnet id that is attached to the NLB

Also the second problem is, the endpoint /v1/sys/health?standbyok is what you want HOWEVER if your vault is sealed they will return 404 and when using an NLB your HTTP/s health checks MUST return 200, theres a parameter called "matcher" which allows you to set what the valid http response codes are HOWEVER that parameter is not allowed with NLB's

Hope that helps

mrproper avatar Sep 23 '22 03:09 mrproper

As of v1.12.0 this issue is still happening and flooding logs.

forty avatar Nov 28 '22 12:11 forty