nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Tasks in allocation fail to start with last event "no message" when using bride-mode and 'hostname = "${node.unique.name}"'

Open SCORP111 opened this issue 1 year ago • 4 comments
trafficstars

Nomad version

1.8.1

Operating system and Environment details

Ubuntu 20.04

Issue

Tasks in allocation fail to start with last event "no message" when using bridge-mode and 'hostname = "${node.unique.name}"' grafik grafik

Reproduction steps

  group "test" {
    network {
      mode = "bridge"
      hostname = "${node.unique.name}"
    }

Expected Result

Tasks in allocation can be startet and deployment succeeds.

Actual Result

Tasks fail to start with "no message", deployment gets stuck in restart loop

Job file (if appropriate)

  group "test" {
    network {
      mode = "bridge"
      hostname = "${node.unique.name}"
    }

SCORP111 avatar Aug 13 '24 09:08 SCORP111

Hi @SCORP111 and thanks for raising this issue. Could you provide more information about the job, such as the driver it is using? The Nomad client logs would also be useful here. Thanks.

jrasell avatar Aug 13 '24 09:08 jrasell

Hi @jrasell, thanks for the quick response!

The job consists of 3 tasks and is using the following drivers:

  • prestart-task-1: raw_exec
  • prestart-task-2: docker
  • service: docker

When trying to get the logs of the tasks, im getting these messages and I am unable to pull the logs. Also in the web-ui I don't see any logs.

me@NOMADS-001:~$ nomad logs -namespace=* 55122a21
Failed to validate task: Allocation "55122a21" is running the following tasks:
  * generate-config
  * nginx-proxy
  * service

Please specify the task.
me@NOMADS-001:~$ nomad logs -namespace=* -task service 55122a21
Failed to read stdout file: error reading file: Unexpected response code: 404 (Unknown allocation "55122a21-6336-a1e6-7d4c-da4832732809")
me@NOMADS-001:~$ nomad logs -namespace=* -task generate-config 55122a21
Failed to read stdout file: error reading file: Unexpected response code: 404 (Unknown allocation "55122a21-6336-a1e6-7d4c-da4832732809")
me@NOMADS-001:~$ nomad logs -namespace=* -task nginx-proxy 55122a21
Failed to read stdout file: error reading file: Unexpected response code: 404 (Unknown allocation "55122a21-6336-a1e6-7d4c-da4832732809")

SCORP111 avatar Aug 13 '24 09:08 SCORP111

Hey @SCORP111, can you provide some more details? Does the job fail to start? Can you show nomad job $jobname status? Perhaps nomad alloc status $allocID? nomad eval list? It's kinda hard to understand what's happening.

pkazmierczak avatar Aug 14 '24 11:08 pkazmierczak

Hey,

sorry for being a little bit unclear in my description. I think I narrowed down the problem: It seems to be a combination of using the raw_exec-driver and setting the hostname in the network block under group.network.

This is working fine:

job "test" {
  datacenters = ["datacenter"]
  type        = "service"

  group "raw_exec" {
    count = 1

    restart {
      attempts = 3
      interval = "2m"
      delay    = "15s"
      mode     = "fail"
    }

    network {
      mode = "bridge"
    }

    task "example" {
      driver = "raw_exec"
      config {
        command = "/bin/sh"
        args    = ["-c", "while true; do echo \"I'm working fine\"; sleep 2; done"]
      }
    }
  }
}

grafik

If I now add for example hostname=test the allocation fails to start:

job "test" {
  datacenters = ["datacenter"]
  type        = "service"

  group "raw_exec" {
    count = 1

    restart {
      attempts = 3
      interval = "2m"
      delay    = "15s"
      mode     = "fail"
    }

    network {
      mode = "bridge"
      hostname = "test"
    }

    task "example" {
      driver = "raw_exec"
      config {
        command = "/bin/sh"
        args    = ["-c", "while true; do echo \"I'm working fine\"; sleep 2; done"]
      }
    }
  }
}

grafik grafik grafik grafik

Okay, after running nomad alloc status it seems that hostname and raw_exec is just not supported: Client Description = Unable to add allocation due to error: failed to configure network manager: hostname is not currently supported on driver raw_exec

Oh and it seems it also says so in the documentation: https://developer.hashicorp.com/nomad/docs/job-specification/network#hostname - ...currently only supported using the Docker driver...

Is there any workaround to keep setting the hostname for docker-drivers, when using bridge mode and also having a prestart-task thats using raw_exec?

Anyway, sorry for the confusion!

Kind regards

SCORP111 avatar Aug 14 '24 15:08 SCORP111

Doing a little issue triage cleanup and saw this one.

Oh and it seems it also says so in the documentation

Right, because to set the hostname we need to give the task a /etc/hostname that's been written elsewhere and then bind-mounted to that location. That can only work for task drivers that have a mount namespace like docker.

Is there any workaround to keep setting the hostname for docker-drivers, when using bridge mode and also having a prestart-task thats using raw_exec?

Because networks are defined at the group level and that means all the tasks share a network namespace, we can't have the tasks have different hostnames without causing a mess. But if you don't want them in the same network namespace, you can override the Docker networking configuration with network_mode and hostname on the task configuration. That looks like this:

job "example" {

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "docker" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]

        network_mode = "bridge"
        hostname     = "example.local"
        ports        = ["www"]
      }
    }

    task "raw" {
      driver = "raw_exec"
      config {
        command = "/bin/sh"
        args    = ["-c", "while true; do echo \"I'm working fine\"; sleep 2; done"]
      }
    }

  }

}

Otherwise it looks like we've got this issue resolved, so I'm going to close it out.

tgross avatar Nov 08 '24 16:11 tgross

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Mar 09 '25 02:03 github-actions[bot]