nomad icon indicating copy to clipboard operation
nomad copied to clipboard

envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"

Open bert2002 opened this issue 4 years ago • 25 comments

Nomad version

Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)

Operating system and Environment details

Issue

After initiating a restart of a allocation, nomad cannot find envoy sidecar image anymore. I did a couple of restarts earlier today (same config) and did not run into this problem and consul does not report any problems.

Nomad Server logs (if appropriate)

Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.751Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=5fa76958-04fe-e6b4-ce6f-9e4647fee2b8 task=connect-proxy-app-backend reason="Restart within policy" delay=16.074283377s
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.840Z [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"
Jan 26 09:30:08 node-3 nomad[6127]:     2021-01-26T09:30:08.842Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c141dfaa-72c5-e589-1f7d-91628e8e501c task=connect-proxy-app-backend-service error="Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format"

Any way to find out what NOMAD_envoy_version is set to?

Cheers, bert2002

bert2002 avatar Jan 26 '21 09:01 bert2002

Hi @bert2002, from the error message I think what's happening is that it's not getting interpolated at all, as that "API error" should be bubbling up from the driver. Can you share the jobspec? It might help us figure out what's going on there.

tgross avatar Jan 26 '21 13:01 tgross

My jobspec has quite a lot of groups, so I will share a limited one.

example.nomad.txt

I had to drain the node and the container got started on another node and it was without any problem. Now on the same node (where the problem was) containers are working again without any problem.

The only thing I can image is a runtime error or external service was not reachable (docker hub, etc.)

Any other idea?

bert2002 avatar Jan 27 '21 07:01 bert2002

@bert2002 this is the jobspec you shared and I don't see any setting for the Envoy proxy image. You saw that error without trying to set the image via interpolation?

job "app1" {

  datacenters = ["staging"]
  type        = "service"

  reschedule {
    delay          = "10s"
    delay_function = "exponential"
    max_delay      = "120s"
    unlimited      = true
  }

  #
  # collectd
  #
  group "collectd" {
    count = 1

    network {
      mode = "bridge"
    }

    restart {
      interval = "2m"
      attempts = 8
      delay    = "15s"
      mode     = "delay"
    }

    service {
      name = "collectd"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "redis"
              local_bind_port  = 6379
            }
          }
        }
      }
    }

    task "collectd" {
      driver = "docker"
      leader = true

      config {
        image = var.docker_image_collectd

        mounts = [
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf"
            source = "/opt/app/collectd/collectd.conf"
          },
          {
            type   = "bind"
            target = "/etc/collectd/collectd.conf.d"
            source = "/opt/app/collectd/collectd.conf.d/"
          },
          {
            type   = "bind"
            target = "/usr/local/lib/collectd"
            source = "/opt/app/collectd/plugins/"
          }
        ]
      }
    }

    task "filebeat" {
      driver = "docker"

      resources {
        memory = 100
        cpu    = 50
      }

      config {
        image = var.docker_image_filebeat

        mounts = [
          {
            type   = "bind"
            target = "/usr/share/filebeat/filebeat.yml"
            source = "/opt/app/filebeat/filebeat.yml"
          }
        ]
      }
    }

  }
}

tgross avatar Jan 27 '21 13:01 tgross

You saw that error without trying to set the image via interpolation?

I ran into that same error today without setting the image (my jobspec is very similar), though I can't reproduce this either unfortunately.

lukas-w avatar Jan 30 '21 10:01 lukas-w

You saw that error without trying to set the image via interpolation?

yes that is correct and it just happened again (on the same node). I drained it again and it works fine on the other nodes. Is there any more information/logs I can provide?

bert2002 avatar Feb 03 '21 02:02 bert2002

it just happened again after I wanted to restart a alloc (on a different node)

bert2002 avatar Feb 05 '21 08:02 bert2002

I updated to Nomad 1.0.3 and Consul 1.9.3, but unfortunately it is happening again. Especially when restarting a allocation manually.

Feb 25, '21 16:22:05 +0800	Driver Failure	Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format

bert2002 avatar Feb 25 '21 09:02 bert2002

@tgross is it possible to set NOMAD_envoy_version myself with an specific version?

bert2002 avatar Feb 25 '21 09:02 bert2002

Hey!

I have the same problem. Could you please provide any information how this can be solved?

empikls avatar Mar 23 '21 13:03 empikls

@empikls as workaround I am using a fixed version. In nomad.hcl I set this meta information:

client {
  enabled = true
  meta {
    "connect.sidecar_image" = "envoyproxy/envoy:v1.16.0@sha256:9e72bbba48041223ccf79ba81754b1bd84a67c6a1db8a9dbff77ea6fc1cb04ea"
  }
}

bert2002 avatar Mar 24 '21 00:03 bert2002

Thanks a lot @bert2002 ! I will try next time when i faced the problem .

empikls avatar Mar 24 '21 12:03 empikls

Had the same problem too. It works before, this bug seems flaky. My deployment works before (apart with some other problems) but I have never encountered this. Until today, the exact same deployment. Seems like a race condition or something.

2021-07-20T11:22:32Z  Driver            Downloading image
2021-07-20T11:22:37Z  Not Restarting    Exceeded allowed attempts 2 in interval 10m0s and mode is "fail"
2021-07-20T11:22:37Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:18Z  Driver            Downloading image
2021-07-20T11:22:19Z  Restarting        Task restarting in 10.268449356s
2021-07-20T11:22:18Z  Driver Failure    Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:04Z  Restarting        Task restarting in 11.565177332s```

gregory112 avatar Jul 20 '21 11:07 gregory112

same issue for me. it's a test job so I just re-run it and it starts working again.

Recent Events:
Time                       Type            Description
2021-07-22T17:30:01-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:30:01-07:00  Driver          Downloading image
2021-07-22T17:29:50-07:00  Restarting      Task restarting in 10.714078758s
2021-07-22T17:29:50-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:50-07:00  Driver          Downloading image
2021-07-22T17:29:38-07:00  Restarting      Task restarting in 12.196588284s
2021-07-22T17:29:38-07:00  Driver Failure  Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:38-07:00  Driver          Downloading image
2021-07-22T17:29:27-07:00  Restarting      Task restarting in 10.856396313s
2021-07-22T17:29:27-07:00  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

cgthayer avatar Jul 23 '21 00:07 cgthayer

+1 facing the same issue.

mr-karan avatar Oct 14 '21 08:10 mr-karan

@mr-karan (and others facing the issue), would you mind clicking the 👍 in the issue so we can better track common problems?

This issue does seem to be a bit unpredictable, so we don't have an update yet. The workaround from @bert2002 is the best option right now. You can check which Envoy version to use based on your Consul version in the docs.

lgfa29 avatar Oct 15 '21 22:10 lgfa29

I can easily reproduce this by simply restarting the sidecar task via the UI.

legege avatar Oct 25 '21 23:10 legege

@legege Can confirm, restart via UI causes the error.

meaty-popsicle avatar Nov 01 '21 15:11 meaty-popsicle

+1 I am experiencing this as well

mattolson avatar Nov 13 '21 00:11 mattolson

+1 got the same error. came out of the blue

xeroc avatar Mar 01 '22 18:03 xeroc

This issue is still present in v 1.3.1. After a restart via UI job fails due to error while pulling envoy image.

meaty-popsicle avatar Jul 22 '22 10:07 meaty-popsicle

What is not clear to me, where is the ${NOMAD_envoy_version} variable set. The documentation says it comes from a consul query? The official upstream Envoy Docker image, where ${NOMAD_envoy_version} is resolved automatically by a query to Consul. Is this something that I can control over Consul then?!

vvarga007 avatar Jul 27 '22 20:07 vvarga007

This is happening to me very, very rarely. It's not a big deal because eventually the alloc gets replaced but it's weird to see it every few days

devyn avatar Jul 29 '22 08:07 devyn

Still happening in v1.4.3

NOBLES5E avatar Dec 07 '22 13:12 NOBLES5E

yup I found the same issue in 1.4.3.
After I moved task to different Node nomad updated envoy.

grzybniak avatar Dec 16 '22 07:12 grzybniak

It is still happening, more often than last week or two. Is there any news regarding this?

Thnx

VladimirZD avatar Dec 19 '22 12:12 VladimirZD

This happened in v1.4.3, when an allocation was manually scaled from count = 2 to count = 3. The start of the envoy sidecar failed 3 times (with delay), and then Nomad finally reallocated the job which was successful.

This issue causes service interruption because Nomad scaling actions make existing allocations restart without reason. Those restarts also fail, and introduce delays before retrying and reallocating the allocation.

hiddewie avatar Jan 31 '23 09:01 hiddewie

The more jobs we add, the more frequently we see this. Initially it was maybe once a month, but now we have service interruptions and downtime on a weekly basis being caused by this.

seanamos avatar Feb 09 '23 06:02 seanamos

+1

We recently experienced this problem as one of our applications running in Prod crashed and we received the error when Nomad was trying to restart the task. We had to issue a restart and the application worked fine after this. Strangely, we cannot reproduce the error just by simply restarting the sidecar task via the UI. The versions we are running are Nomad v1.3.5, Consul v1.12.4.

ovelascoh avatar Feb 21 '23 01:02 ovelascoh

Same issue. For us it happened when manually restarting an alloc with a connect proxy sidecar

exFalso avatar Feb 21 '23 13:02 exFalso

One more occurrence of this, and same as in the previous message it happened when I manually restarted an allocation from the Nomad UI. Using Nomad 1.4.3

ivantopo avatar Apr 03 '23 11:04 ivantopo