nomad icon indicating copy to clipboard operation
nomad copied to clipboard

allow DHV requests to be both per_alloc and sticky

Open inahga opened this issue 3 months ago • 9 comments

Nomad version

Output from nomad version

Nomad v1.10.4
BuildDate 2025-08-12T20:48:32Z
Revision 62b195aaa535b2159d215eaf89e6f4a455d6f686

Operating system and Environment details

Ubuntu 24.04 on QEMU/KVM.

Issue

When using dynamic host volumes, you can use the per_alloc setting.

This contradicts the documentation, which states:

Use per_alloc only with CSI volumes and sticky only with dynamic host volumes.

If this is the case, a host volume with per_alloc = true should fail job validation.

Reproduction steps

Create a host volume

name      = "example[0]"
type      = "host"

plugin_id = "mkdir"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

Deploy a spec

job "example" {
  type      = "service"
  node_pool = "ifn"

  group "ubuntu" {
    count = 1

    volume "data" {
      type   = "host"
      source = "example"
      per_alloc = true
    }

    task "ubuntu" {
      driver = "podman"

      config {
        image = "docker.io/library/ubuntu:noble"
        command = "/usr/bin/sleep"
        args = ["infinity"]
      }

      volume_mount {
        volume      = "data"
        destination = "${NOMAD_ALLOC_DIR}/data"
      }
    }
  }
}

Expected Result

Actually, this behavior is totally fine! I like it this way, since it solves the problem of allocating multiple host volumes to a replicated service (e.g. mysql, etcd).

I'm just not sure if this was the intended behavior behind per_alloc, if the documentation is wrong, and/or whether using per_alloc in this way has unforeseen consequences and risks.

Can the maintainers tell me whether this is a broken usage of per_alloc. If so, I'd actually like us to fix the problems that prevent it from being used for host volumes (happy to help). If not, we should update the documentation to state using per_alloc is OK.

Actual Result

It deploys fine (hopefully a good thing!)

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

inahga avatar Nov 21 '25 21:11 inahga

Edit: ignore everything I wrote originally here, as I was misreading your post. Fixed version below: 😊

Hi @inahga! You're right, this should fail validation. I'll mark this for roadmapping.

tgross avatar Nov 21 '25 21:11 tgross

More broadly, what can we do to make per_alloc work for dynamic host volumes? What about it breaks when applied to DHVs and not CSI volumes?

I think its behavior is desirable for DHVs. In fact, it is probably a blocker for us using DHVs and running stateful workloads on Nomad. The example scenario I outlined in the issue is actually what I want to happen, and it ostensibly works, but it's relying on undocumented and maybe subtly broken behavior.

inahga avatar Nov 21 '25 22:11 inahga

And perhaps to prevent XY problem: I want to use DHVs to run stateful workloads that do replication at the application layer, e.g. let's just say mysql. So having host-attached volumes is totally fine.

The problem with a naive approach (i.e. without per_alloc) is that I can't make a mysql group where each replica has its own volume.

The best I can do is dynamic "group", like so.

variables {
  volume_index = ["0", "1", "2", "3"]
}

job "mysql" {
  type      = "service"

  dynamic "group" {
    for_each = var.volume_index
    labels   = ["mysql${group.value}"]

    content {
      volume "mysql" {
        type   = "host"
        source = "mysql${group.value}"
        sticky = true
      }
      task "mysql" {
        // ...
        volume_mount {
          volume      = "mysql"
          destination = "${NOMAD_ALLOC_DIR}/data"
        }
      }
    }
  }

  update {
    max_parallel     = 1
    auto_revert      = true
    healthy_deadline = "9m"
    min_healthy_time = "30s"
  }
}

This works but not well. Updates don't really work in this scenario--if I start a new job, then each independent group is restarted at the same time, causing downtime.

The other alternative is to define a job for each volume, but then I'm on my own for scheduling updates. This defeats one of the key uses for Nomad.

I think this saga has been played out before for CSI volumes. See https://github.com/hashicorp/nomad/issues/7877

inahga avatar Nov 21 '25 22:11 inahga

More broadly, what can we do to make per_alloc work for dynamic host volumes? What about it breaks when applied to DHVs and not CSI volumes?

The per_alloc flag requires that each volume have its own name (which was a bit jankily implemented by requiring the [$NUMBER] suffix). Whereas with host volumes, the name is shared with each node's volume having a unique ID. I suppose there's nothing in particular that prevents us from implementing the exact same logic for host volumes, but that workflow for CSI was just ugly so we didn't intend to reproduce it.

For stateful workloads where the identity of a volume matters, the envisioned workflow is that you'd use a volume definition like:

volume "mysql" {
  type        = "host"
  source      = "mysql-data"
  access_mode = "single-node-single-writer"
  sticky      = true
}

Then each allocation of the MySQL cluster ends up on a different node (as expected) and there's a persistent volume claim on that data for that allocation. If you replace an allocation in the cluster during a job upgrade, it'll go back to that same volume. Adding per_alloc would add an extra step for the volume author to add the numeric suffix to the names of the volumes.

So I don't think I'd be opposed to adding per_alloc support, but it does feel unnecessary. Maybe I'm missing something here?

tgross avatar Dec 01 '25 18:12 tgross

but that workflow for CSI was just ugly

Indeed, if there's a cleaner solution I'm all for it.

So I don't think I'd be opposed to adding per_alloc support, but it does feel unnecessary. Maybe I'm missing something here?

Your snippet works perfectly if the application is to have 1:1 mapping between allocation and node.

However, if multiple allocations can be resident on the same node, this is where I'm stuck. My exact use case is horizontal sharding of MySQL (Vitess), where a single machine is expected to host many small MySQL replicas.

I also imagine this can apply where the backing storage is shared (NFS), so a replica can be safely rescheduled on a node that already hosting another replica.

inahga avatar Dec 01 '25 19:12 inahga

However, if multiple allocations can be resident on the same node, this is where I'm stuck. My exact use case is horizontal sharding of MySQL (Vitess), where a single machine is expected to host many small MySQL replicas.

Ah, of course. That makes sense. And thinking about the wording in the docs again I think this is a case where we're encouraging a pattern without mandating it. I think you've made a good case here for why we should clarify the docs and not fail validation.

And as you've no doubt noted, we can't run with both per_alloc and sticky=true:

$ nomad job run ./example.nomad.hcl
Error submitting job: Unexpected response code: 500 (1 error occurred:
        * Task group group validation failed: 1 error occurred:
        * Task group volume validation for db failed: 1 error occurred:
        * volume cannot be per_alloc and sticky at the same time)

I just went back to the original internal design doc around sticky=true and can't find the reasoning behind that, only that the features were intentionally exclusive. Let me go back to the team and chat with them about the "why?" of that. I could see it being helpful to have both in the use case you've described here.

tgross avatar Dec 02 '25 15:12 tgross

Ok, I had a chat with my colleague @pkazmierczak and here's where we landed. The reason that per_alloc wasn't originally intended for DHV is because in CSI the volume.source refers to the volume ID and not the volume name. The ID is unique per namespace, whereas the volume name is not. In DHV, the volume.source refers to the volume name because we needed that for backwards compatibility with static host volumes and because otherwise you'd need to refer to the UUID in the jobspec, which wouldn't work for group.count > 1.

But DHV does support per_alloc explicitly in the scheduler code. The problem is that it imposes a requirement on you, the user, to ensure that the volume name is unique across the cluster. Ex. in your original example jobspec if you have two volumes named example[2] on different nodes, then there's nothing making the association to a specific volume "sticky" to one or the other.

That being said, there's nothing we can easily see making it impossible to have both per_alloc and sticky on the same DHV, other than the validation we currently have. It would be nice to have this because that would make it safer to use without worrying about volume name collisions. We want to make sure we've verified that's actually safe before removing that check though.

So here's what we're going to do:

  • We're going to update the documentation to make the implicit requirements for using DHV with per_alloc clear, and also make that documentation clear that it's not a requirement.
  • We're going to re-title this issue (again!) to make this "allow per_alloc and sticky on the same volume request"

Thanks for your patience here @inahga! It really helps us to be able to talk through these use cases!

tgross avatar Dec 02 '25 16:12 tgross

Docs PR: https://github.com/hashicorp/web-unified-docs/pull/1426

tgross avatar Dec 02 '25 16:12 tgross

That sounds good. Thanks for working through this!

inahga avatar Dec 02 '25 16:12 inahga