nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Add NOMAD_WORKLOAD to CNI args

Open apollo13 opened this issue 1 year ago • 8 comments
trafficstars

Fixes #23830

apollo13 avatar Oct 29 '24 13:10 apollo13

Heyhey, thanks for the PR!

One reason we structured the new(ish) cni{} block the way we did, like

cni {
  args = {"key": "val", ...}
}

was to leave the door open for something like this alongside the explicit user-provided args:

cni {
  nomad_args = true
}

which would magically transform into some standard args that CNI plugin authors could depend on (like requested in #23830!)

I'm more inclined to provide an opt-in mechanism like that. What do you think?

Although I haven't written many CNI plugins myself (a couple), I also would prefer separate args instead of a JSON blob, and I think following the same convention as environment variables would be nice too, for consistency.

Following that ^, this task could be accomplished with jobspec mutation at registration time, which would turn cni{ nomad_args = true} into

cni {
  args = {
    "NOMAD_NAMESPACE": "${NOMAD_NAMESPACE}",
    "NOMAD_JOB_ID": "${NOMAD_JOB_ID}",
    ... etc ...
  }
}

and that would get interpolated just as user-provided args do, so wouldn't need changes all the way down in allocrunner.

How does that strike you @apollo13?

gulducat avatar Oct 30 '24 18:10 gulducat

I think @gulducat is on the right track here, with one suggestion:

Following that ^, this task could be accomplished with jobspec mutation at registration time

We probably don't need/want to do that mutation at registration time, because we have the values available on the client as part of the taskenv interpolation. So we can transform the args right in the cni hook here.

tgross avatar Oct 30 '24 19:10 tgross

In that case, the new nomad_args (is that a good name?) bool would need to be passed from the jobspec through to the allocrunner cni hook. The PR that introduced cni-args-in-the-jobspec #23538 shows all the plumbing for reference.

Alternatively / additionally, the opt-in toggle could be in client config, which would apply to all allocs that land on the node. The plumbing for that would be different from the jobspec approach.

gulducat avatar Oct 30 '24 19:10 gulducat

Alternatively / additionally, the opt-in toggle could be in client config, which would apply to all allocs that land on the node. The plumbing for that would be different from the jobspec approach.

That's a good point. The cluster admin is nominally the owner of things like what CNI plugins are available. Maybe just make it opt-out on the client?

tgross avatar Oct 30 '24 20:10 tgross

Although I haven't written many CNI plugins myself (a couple), I also would prefer separate args instead of a JSON blob, and I think following the same convention as environment variables would be nice too, for consistency.

Works for me, I haven't written any yet I just wanted some PR to get the discussion started. I am fine with separate env variables instead.

That's a good point. The cluster admin is nominally the owner of things like what CNI plugins are available. Maybe just make it opt-out on the client?

Do we win much by making it possible to opt-out at all? I am asking because the more options the more to test and given that we set IgnoreUnknown anyways I cannot imagine any CNI plugin having problems with NOMAD_* vars.

apollo13 avatar Oct 30 '24 20:10 apollo13

Pushed a commit with env variables. Are there any others we should add? Task imo makes no sense since the networking information is shared by all tasks in a group.

apollo13 avatar Oct 31 '24 08:10 apollo13

Aside from "plugins may not properly follow the spec" I can think of only one Nomad reason to make this opt-out-able (or even opt-in). Job IDs (and group labels) can have ; in them, which would break the CNI_ARGS spec. This is a valid job!

job "semi;colon" {
  type = "batch"
  group "et;tu;groupname" {
    network {
      mode = "bridge"
    }
    task "t" {
      driver = "docker"
      config {
        image = "hello-world"
      }
    }
  }
}

$ nomad run semicolon.nomad.hcl
...
$ nomad status semi
ID            = semi;colon
...
Allocations
ID        Node ID   Task Group       Version  Desired  Status
1f795fc0  f11e544f  et;tu;groupname  0        run      complete

That job would suddenly break if ID/group were sent in CNI_ARGS (bridge CNI would barf; see comment from cni_args PR: https://github.com/hashicorp/nomad/pull/23538#discussion_r1676196194).

Maybe we shouldn't need to protect users from this tomfoolery, and it may be exceedingly rare (or never happen!), but it could cause a bad day for someone. I would love to be able to say "just change the job ID" but who knows what wacky conventions people have wrapped up in their IDs, so it would be nice to have some kind of opt-out mechanism, either in jobspec or client config. In this weird example, the "error" is in the spec, only affects this one job, so opting out would make sense to include there. But people may have plugins that handle it poorly, which would be a client/node concern...

I'm on the fence. Do y'all have any more thoughts here?


As for other args to include, I think this is a nice set to start with, and folks can request more later if the need arises.

gulducat avatar Oct 31 '24 16:10 gulducat

Job IDs (and group labels) can have ; in them, which would break the CNI_ARGS spec. This is a valid job!

Sigh, and according to the spec there seems to be no way to escape that. We could escape it in json though ;)

Maybe we shouldn't need to protect users from this tomfoolery, and it may be exceedingly rare (or never happen!), but it could cause a bad day for someone

Agreed. Sadly the non-restrictiveness of jobnames has already caused quite a few problems elsewhere.

so opting out would make sense to include there.

But then we'd need it on the job-level again and not on the client I fear.

Btw, this is completely ugly but mostly a matter of documentation. We could encode the semicolon using \x… or similar -- this way we'd have an extra parse step in the plugins but it would be safe for everything.

apollo13 avatar Oct 31 '24 19:10 apollo13

We kicked this around internally, and maybe this is a nice simple solution that puts the onus on job authors to not use semicolons, instead of plugin authors to parse them: Only set the CNI arg if the value does not contain a ";". Then the code is a simple if condition in the Nomad client, and nothing extra required on the plugin side.

How does that sound to you @apollo13 ?

gulducat avatar Nov 05 '24 20:11 gulducat

And what is the plugin supposed to do then? Throw it's hands up in the air or just panic and not assign an IP to the container :dagger:

Joking aside, I think that would be an even harder sell. If a job author uses ";" in the name and the plugin does something different because it has no name (ie network policies can't be applied) then who is supposed to figure out what went wrong and how? Sounds like a debugging nightmare.

apollo13 avatar Nov 05 '24 20:11 apollo13

If it could degrade gracefully, depending on the plugin, then maybe "do something different", but I would exit with a non-zero exit code and output a message with whatever required info is missing (job ID and/or group name). We'd have documentation specifying these limits, maybe log WARN in Nomad agent logs that we're skipping the value, and if the CNI plugin makes a fuss and fails then that gets surfaced too.

e.g. the existing bridge plugin fails if it encounters a spec-breaking ; and that shows up in Nomad client agent logs:

[ERROR] allocrunner/alloc_runner.go:383: client.alloc_runner: prerun failed: alloc_id=e42a7676-31f6-ca3c-2d4d-9945e82b3759 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon""

and it also shows up as a task event in alloc status:

Recent Events: Time Type Description 2024-11-05T14:57:29-05:00 Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon"

I'll grant you the log is rather verbose for a short message from the plugin, but it's pretty discoverable, eh?

gulducat avatar Nov 05 '24 21:11 gulducat

Works for me, just wanted to be sure that this really what we want. That said, I personally don't have any use for jobnames containing a semicolon and would probably welcome a stricter validation there (yeah backwards compat is not going to allow for that -- though I wonder if it might be possible to make it stricter over multiple releases).

On Tue, Nov 5, 2024, at 22:19, Daniel Bennett wrote:

If it could degrade gracefully, depending on the plugin, then maybe "do something different", but I would exit with a non-zero exit code and output a message with whatever required info is missing (job ID and/or group name). We'd have documentation specifying these limits, maybe log WARN in Nomad agent logs that we're skipping the value, and if the CNI plugin makes a fuss and fails then that gets surfaced too.

e.g. the existing bridge plugin fails if it encounters a spec-breaking ; and that shows up in Nomad client agent logs:

[ERROR] allocrunner/alloc_runner.go:383: client.alloc_runner: prerun failed: alloc_id=e42a7676-31f6-ca3c-2d4d-9945e82b3759 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon""

and it also shows up as a task event in alloc status:

Recent Events: Time Type Description 2024-11-05T14:57:29-05:00 Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): ARGS: invalid pair "colon"

I'll grant you the log is rather verbose for a short message from the plugin, but it's pretty discoverable, eh?

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/24319#issuecomment-2458170021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAT5C3ZWNDG2HOXDCBBE5LZ7EY5NAVCNFSM6AAAAABQZ3C7HOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGE3TAMBSGE. You are receiving this because you were mentioned.Message ID: @.***>

apollo13 avatar Nov 05 '24 21:11 apollo13

Yeah I think the safe thing for today that will also work for 99.9% (totally guessing at that number) of jobs is to:

  • skip values that contain ";"
  • log that we did so
  • document this behavior (bottom of networking/cni doc should be good, I think)
  • test it

and otherwise always provide these defaults (except job id instead of name).

Bonus discoverability option would communicate directly to users earlier on, too:

  • issue a warning at job submission time if any network.mode = "cni/*" and job id/group name(s) contain ";"

That all work for you @apollo13 ? You up for implementing these changes? :grin:

gulducat avatar Nov 06 '24 19:11 gulducat

Works for me yeah, should be able to implement it as well. No promises on the time-frame though since I am swamped with other work currently and don't want to commit to anything…

apollo13 avatar Nov 06 '24 19:11 apollo13

@gulducat Found some time after all (weekends and such ;)). Feel free to adjust the PR as needed, especially if it is less work than playing back & forth with me.

apollo13 avatar Nov 10 '24 19:11 apollo13

Your changes look good imo, I especially like that we don't have duplicate the env names now.

apollo13 avatar Nov 12 '24 18:11 apollo13

title & description update, let me know if anything else is missing.

apollo13 avatar Nov 18 '24 21:11 apollo13

Thanks @apollo13!

schmichael avatar Nov 19 '24 20:11 schmichael

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Mar 20 '25 02:03 github-actions[bot]