nomad
nomad copied to clipboard
envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"
Nomad version
Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)
Operating system and Environment details
Issue
After initiating a restart of a allocation, nomad cannot find envoy sidecar image anymore. I did a couple of restarts earlier today (same config) and did not run into this problem and consul does not report any problems.
Nomad Server logs (if appropriate)
Jan 26 09:30:08 node-3 nomad[6127]: 2021-01-26T09:30:08.751Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=5fa76958-04fe-e6b4-ce6f-9e4647fee2b8 task=connect-proxy-app-backend reason="Restart within policy" delay=16.074283377s
Jan 26 09:30:08 node-3 nomad[6127]: 2021-01-26T09:30:08.840Z [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=envoyproxy/envoy:v${NOMAD_envoy_version} error="API error (400): invalid tag format"
Jan 26 09:30:08 node-3 nomad[6127]: 2021-01-26T09:30:08.842Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=c141dfaa-72c5-e589-1f7d-91628e8e501c task=connect-proxy-app-backend-service error="Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format"
Any way to find out what NOMAD_envoy_version
is set to?
Cheers, bert2002
Hi @bert2002, from the error message I think what's happening is that it's not getting interpolated at all, as that "API error" should be bubbling up from the driver. Can you share the jobspec? It might help us figure out what's going on there.
My jobspec has quite a lot of groups, so I will share a limited one.
I had to drain the node and the container got started on another node and it was without any problem. Now on the same node (where the problem was) containers are working again without any problem.
The only thing I can image is a runtime error or external service was not reachable (docker hub, etc.)
Any other idea?
@bert2002 this is the jobspec you shared and I don't see any setting for the Envoy proxy image. You saw that error without trying to set the image via interpolation?
job "app1" {
datacenters = ["staging"]
type = "service"
reschedule {
delay = "10s"
delay_function = "exponential"
max_delay = "120s"
unlimited = true
}
#
# collectd
#
group "collectd" {
count = 1
network {
mode = "bridge"
}
restart {
interval = "2m"
attempts = 8
delay = "15s"
mode = "delay"
}
service {
name = "collectd"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "redis"
local_bind_port = 6379
}
}
}
}
}
task "collectd" {
driver = "docker"
leader = true
config {
image = var.docker_image_collectd
mounts = [
{
type = "bind"
target = "/etc/collectd/collectd.conf"
source = "/opt/app/collectd/collectd.conf"
},
{
type = "bind"
target = "/etc/collectd/collectd.conf.d"
source = "/opt/app/collectd/collectd.conf.d/"
},
{
type = "bind"
target = "/usr/local/lib/collectd"
source = "/opt/app/collectd/plugins/"
}
]
}
}
task "filebeat" {
driver = "docker"
resources {
memory = 100
cpu = 50
}
config {
image = var.docker_image_filebeat
mounts = [
{
type = "bind"
target = "/usr/share/filebeat/filebeat.yml"
source = "/opt/app/filebeat/filebeat.yml"
}
]
}
}
}
}
You saw that error without trying to set the image via interpolation?
I ran into that same error today without setting the image (my jobspec is very similar), though I can't reproduce this either unfortunately.
You saw that error without trying to set the image via interpolation?
yes that is correct and it just happened again (on the same node). I drained it again and it works fine on the other nodes. Is there any more information/logs I can provide?
it just happened again after I wanted to restart a alloc (on a different node)
I updated to Nomad 1.0.3
and Consul 1.9.3
, but unfortunately it is happening again. Especially when restarting a allocation manually.
Feb 25, '21 16:22:05 +0800 Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
@tgross is it possible to set NOMAD_envoy_version
myself with an specific version?
Hey!
I have the same problem. Could you please provide any information how this can be solved?
@empikls as workaround I am using a fixed version. In nomad.hcl
I set this meta information:
client {
enabled = true
meta {
"connect.sidecar_image" = "envoyproxy/envoy:v1.16.0@sha256:9e72bbba48041223ccf79ba81754b1bd84a67c6a1db8a9dbff77ea6fc1cb04ea"
}
}
Thanks a lot @bert2002 ! I will try next time when i faced the problem .
Had the same problem too. It works before, this bug seems flaky. My deployment works before (apart with some other problems) but I have never encountered this. Until today, the exact same deployment. Seems like a race condition or something.
2021-07-20T11:22:32Z Driver Downloading image
2021-07-20T11:22:37Z Not Restarting Exceeded allowed attempts 2 in interval 10m0s and mode is "fail"
2021-07-20T11:22:37Z Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:18Z Driver Downloading image
2021-07-20T11:22:19Z Restarting Task restarting in 10.268449356s
2021-07-20T11:22:18Z Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-20T11:22:04Z Restarting Task restarting in 11.565177332s```
same issue for me. it's a test job so I just re-run it and it starts working again.
Recent Events:
Time Type Description
2021-07-22T17:30:01-07:00 Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:30:01-07:00 Driver Downloading image
2021-07-22T17:29:50-07:00 Restarting Task restarting in 10.714078758s
2021-07-22T17:29:50-07:00 Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:50-07:00 Driver Downloading image
2021-07-22T17:29:38-07:00 Restarting Task restarting in 12.196588284s
2021-07-22T17:29:38-07:00 Driver Failure Failed to pull `envoyproxy/envoy:v${NOMAD_envoy_version}`: API error (400): invalid tag format
2021-07-22T17:29:38-07:00 Driver Downloading image
2021-07-22T17:29:27-07:00 Restarting Task restarting in 10.856396313s
2021-07-22T17:29:27-07:00 Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
+1 facing the same issue.
@mr-karan (and others facing the issue), would you mind clicking the 👍 in the issue so we can better track common problems?
This issue does seem to be a bit unpredictable, so we don't have an update yet. The workaround from @bert2002 is the best option right now. You can check which Envoy version to use based on your Consul version in the docs.
I can easily reproduce this by simply restarting the sidecar task via the UI.
@legege Can confirm, restart via UI causes the error.
+1 I am experiencing this as well
+1 got the same error. came out of the blue
This issue is still present in v 1.3.1. After a restart via UI job fails due to error while pulling envoy image.
What is not clear to me, where is the ${NOMAD_envoy_version} variable set. The documentation says it comes from a consul query?
The official upstream Envoy Docker image, where ${NOMAD_envoy_version} is resolved automatically by a query to Consul.
Is this something that I can control over Consul then?!
This is happening to me very, very rarely. It's not a big deal because eventually the alloc gets replaced but it's weird to see it every few days
Still happening in v1.4.3
yup I found the same issue in 1.4.3.
After I moved task to different Node nomad updated envoy.
It is still happening, more often than last week or two. Is there any news regarding this?
Thnx
This happened in v1.4.3
, when an allocation was manually scaled from count = 2
to count = 3
. The start of the envoy sidecar failed 3 times (with delay), and then Nomad finally reallocated the job which was successful.
This issue causes service interruption because Nomad scaling actions make existing allocations restart without reason. Those restarts also fail, and introduce delays before retrying and reallocating the allocation.
The more jobs we add, the more frequently we see this. Initially it was maybe once a month, but now we have service interruptions and downtime on a weekly basis being caused by this.
+1
We recently experienced this problem as one of our applications running in Prod crashed and we received the error when Nomad was trying to restart the task. We had to issue a restart and the application worked fine after this. Strangely, we cannot reproduce the error just by simply restarting the sidecar task via the UI. The versions we are running are Nomad v1.3.5, Consul v1.12.4.
Same issue. For us it happened when manually restarting an alloc with a connect proxy sidecar
One more occurrence of this, and same as in the previous message it happened when I manually restarted an allocation from the Nomad UI. Using Nomad 1.4.3