terraform-provider-nomad icon indicating copy to clipboard operation
terraform-provider-nomad copied to clipboard

Terraform fails to notice when a Nomad job has changed

Open hashibot opened this issue 7 years ago • 40 comments

This issue was originally opened by @blalor as hashicorp/terraform#14038. It was migrated here as part of the provider split. The original body of the issue is below.


Terraform Version

Terraform v0.9.3

Affected Resource(s)

  • nomad_job

Terraform Configuration Files

provider "nomad" {
    address = "http://localhost:4646"
}

resource "nomad_job" "example" {
    jobspec = "${file("${path.module}/example.nomad")}"
}

Debug Output

Log for 2nd terraform apply: apply.log.txt

Console output:

$ TF_LOG=TRACE TF_LOG_PATH=apply.log.txt terraform apply
nomad_job.example: Refreshing state... (ID: example)

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Expected Behavior

Terraform should have noticed that the job had changed in the cluster and updated it.

Actual Behavior

Nuttin', honey.

Steps to Reproduce

  1. nomad agent -dev

  2. terraform apply

  3. nomad status example shows one allocation for task group grp

  4. change task group count:

     curl -sfS localhost:4646/v1/job/example | \
         jq '.TaskGroups[0].Count = 2 | {"Job": .}' | \
         curl -is -X PUT -d@- localhost:4646/v1/job/example
    
  5. nomad status example shows two allocations for example.grp

  6. terraform plan shows no change

  7. terraform apply makes no change

  8. nomad status example still shows two allocations

hashibot avatar Jun 13 '17 20:06 hashibot

@paddycarver / @grubernaut / @radeksimko, is this a verified issue/bug? I'm seeing a slightly different issue, but not sure if it's the same as this. If a job has been stopped, or is not running (but the spec has not changed), then the TF nomad provider should update nomad. It might be best to always have the job sent to nomad, and let nomad work out issues with differences in the spec? This issue makes the provider unusable for the most basic use cases.

ketzacoatl avatar Feb 20 '18 19:02 ketzacoatl

The underlying issue here also produces the following behavior:

  • run a job with the terraform provider
  • nomad stop that job manually
  • re-apply the terraform plan, the job is still stopped

ketzacoatl avatar Feb 22 '18 09:02 ketzacoatl

Hi @ketzacoatl! I can't say for sure whether this is a verified bug, nor can I explain the behaviour. I'll try to look into this soon and come back with a bit more information, and a fix if necessary. Apologies for the delays on this.

paddycarver avatar Feb 22 '18 21:02 paddycarver

hi @paddycarver, thanks for taking a look! were you able to confirm the behavior we see in practice?

ketzacoatl avatar Mar 23 '18 16:03 ketzacoatl

My observation is the same as @ketzacoatl.

It would be just more operator friendly if users could interact via Terraform, rather than nomad status

My versions are Terraform 0.11.5 Nomad (built from master) (0.8.0-dev)

shantanugadgil avatar Mar 31 '18 04:03 shantanugadgil

Any updates on this @paddycarver?

Also, a question: Does the Terraform provider always submit the job to nomad, or does it decide whether or not it should?

ketzacoatl avatar Apr 26 '18 10:04 ketzacoatl

@paddycarver I took a quick look inside the codebase, and it seems the culprit is:

  • resourceJobRead is not implemented
  • there is no metadata to discern changes other than job ID within the schema
  • de-then-registering only occurs when ID itself changes

#15 by @apparentlymart adds more metadata, but modify_index wouldn’t pick up changes from the provider. I think “Solution 1” below is a nice way of dealing with it.

Solution 1: add status computed metadata to the schema, plus #15 other items

  • Upon read, check with job.Stopped() whether it should be re-registered
  • Notify through metadata if stopped that recreation is necessary

Pros

  • Smarter information for plan
  • Will always restart a stopped job according to configuration

Cons

  • plan now has to read more information before, possibly slowing it down
  • More changes to the codebase

Solution 2: Always de-register and re-register

  • Refactor resourceJobRegister so that it always de-then-re-register, regardless of configuration change or not.
  • Always mark resource job specs as tainted

Pros

  • Easy to add
  • Never leads to stale jobs not being picked up
  • Retains insanely fast plan with Nomad jobs

Cons

  • Users probably do not expect that to happen if they did not change their configurations
  • Ugly plans with unnecessary information

kerscher avatar May 03 '18 12:05 kerscher

@paddycarver, I'd love to help resolve this problem, as IMO it prevents serious use of Terraform to manage nomad, do you have any guidance or recommendations on how you would like to address the problem?

cc @katbyte / @radeksimko

ketzacoatl avatar Jun 28 '18 16:06 ketzacoatl

@mitchellh I'd be very grateful for your feedback/guidance here.

ketzacoatl avatar Jul 20 '18 23:07 ketzacoatl

@cgbaker, With renewed development efforts here, what are your thoughts on this issue?

ketzacoatl avatar Dec 06 '18 06:12 ketzacoatl

Hi @ketzacoatl, I wholeheartedly agree that this is something that must be addressed. This shortcoming makes the Nomad provider just about useless for Day 2 operations.

As noted in #15 , there are a few different options for addressing this. The solution partially hinges on whether we want to try to find a general solution that doesn't require modifying and re-releasing the provider every time Nomad releases a new version. On the other hand, with the Nomad product team taking ownership of this TF provider, we can potentially address such a workflow a little better than before. And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Thank you for your patience and persistence on this issue.

cgbaker avatar Dec 06 '18 17:12 cgbaker

And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

That would be great, WRT being able to continue using the provider while future improvements get worked out.

ketzacoatl avatar Dec 11 '18 21:12 ketzacoatl

Nomad 0.8.x API support is available in the Nomad v1.3.0 that released today. I will look at finding a longer-term solution for this in upcoming versions; having said that, even if we continue version the Nomad providers, we pledge to be much more responsive in updating the Nomad provider going forward.

cgbaker avatar Dec 20 '18 21:12 cgbaker

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Hello, just checking back up on this - have there been any improvements related to this?

ketzacoatl avatar Apr 04 '19 05:04 ketzacoatl

No update as of now, we've been focusing on core Nomad work, finishing the 0.9.0 release.

On Thu, Apr 4, 2019 at 1:02 AM ketzacoatl [email protected] wrote:

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Hello, just checking back up on this - have there been any improvements related to this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraform-providers/terraform-provider-nomad/issues/1#issuecomment-479750800, or mute the thread https://github.com/notifications/unsubscribe-auth/ABmPT9wkf1S_J33gI6wrFz0StYLsaQTuks5vdYdQgaJpZM4N5Ctd .

cgbaker avatar Apr 04 '19 17:04 cgbaker

Fair enough, thanks for the update!

ketzacoatl avatar Apr 11 '19 03:04 ketzacoatl

Update: this was not resolved in the 1.4.0 release of the provider, but we're still tracking this.

cgbaker avatar Jun 05 '19 22:06 cgbaker

I am stoked to see this on the 1.5 milestone, rock on!

ketzacoatl avatar Sep 13 '19 20:09 ketzacoatl

It have been a year since the last comment. Any update on this?

liemle3893 avatar Jul 14 '20 09:07 liemle3893

@cgbaker WRT the roadmap, are there refactors or internal changes that block fixing this properly?

ketzacoatl avatar Jul 14 '20 14:07 ketzacoatl

we were waiting for the update 2.0 plugin SDK to see if it helped deal with this. we're actively looking at it now; we want this issue dealt with in the next few months as part of the Nomad 1.0 milestone.

cgbaker avatar Jul 14 '20 19:07 cgbaker

Hi everyone,

This issue has been going for a while, but we are still actively tracking it. I also think we are overdue for a more thorough update, so here is a description of the problem and what we've tried so far (and why it didn't work).

Why this is happening

This issue happens because the job resource doesn’t have separate attributes for each possible field in the jobspec. Instead, it receives a jobspec as a string (or a path to file that is read into a string), and parses some of the fields into computed attributes.

Since all attributes are computed, Terraform is not able to automatically detect changes, so the provider needs a customized diff function to properly communicate to Terraform if anything has changed. Unfortunately Terraform doesn’t provide the current value stored in the state for computed fields to the diff function, so it can't actually detect changes.

What we've tried so far

Here are some of approaches we've tried and why they didn't work.

Declare all jobspec fields as attributes

Having the jobspec as the single attribute forces us to do Terraform’s job of computing the diff, which, as described above, we can’t do properly. Having the full jobspec described as the resource schema would be more in-line with the way Terraform works.

But even though both Nomad and Terraform use HCL, the structures supported by each of them are different, thus creating a new syntax for defining jobs that is not compatible with the nomad job run command nor the Nomad API.

For example, in Nomad a group is defined as:

job "example" {
  group "cache" {
    ...
  }
}

In Terraform this job would be declared as:

resource "nomad_job_new" "example" {
  name = "example"

  group {
    name = "cache"
  }
}

While having the entire jobspec as the single input for the resource is not ideal, it does provide a nice user experience, where the job is defined only once and can be reused in many different places.

You can check the code for this attempt here.

Having jobspec fields defined as optional attributes

If we keep the jobspec attribute but also have the other fields allowed in the job as optional attributes, we could keep the nice user experience of being able to reuse job files but also define some of the values as proper Terraform attributes.

Unfortunately this doesn’t work well because if an attribute is not marked as computed the provider is not able to set a value for it, so anything that is defined in the jobspec but not in the Terraform resource would not be tracked.

Having jobspec fields defined as optional and computed attributes

Terraform does allow attributes to be optional and computed at the same time. But this falls back to the original issue where the current state value for computed attributes are not available to customized diff functions.

I hope this explanation was helpful, and we will provide more updates as we go. In the meantime feel free to sends us any questions, comments or ideas.

lgfa29 avatar Jul 28 '20 16:07 lgfa29

Thanks for the update @lgfa29, it certainly gives clarity on why this is not trivial.

Just curious if you have a current working idea of how this should be addressed or if it's still being explored?

Legogris avatar Aug 03 '20 09:08 Legogris

Yes, we are still actively looking for a solution. I will make sure to post new updates, even for failed attempts.

lgfa29 avatar Aug 06 '20 21:08 lgfa29

Hi, I've not yet tried it but why not leverage the diff abilities from Nomad to make the diff? Either by exposing it in the Nomad client or adding an endpoint that would return the diff between the new job and the current one (but not actually submitting it).

remilapeyre avatar Aug 06 '20 21:08 remilapeyre

hi @remilapeyre , we already do that for CustomizeDiff: https://github.com/terraform-providers/terraform-provider-nomad/blob/master/nomad/resource_job.go#L518

the issue is that the current Terraform plugin SDK doesn't care about all those computed fields that we get from the Nomad API... we would need to update the jobspec. there are some options there, where we use it as a sentinel, which we're looking into. and we've had some discussion with the Terraform plugin SDK team. We are pursuing a few different options, one of which may just be a new, structured nomad_job resource which doesn't have these problems.

cgbaker avatar Aug 12 '20 19:08 cgbaker

@lgfa29 I like the progress in that PR, and I would love to see that. I have zero qualms as a user with cutting off a major version and breaking a lot of compatibility. We already do this today with the Vault provider where some of our modules depend on older versions and others on newer versions.

I am a BIG! +1 on

resource "nomad_job_new" "example" {
  name = "example"

  group {
    name = "cache"
  }
}

I can see how I could modularize this already and produce a terraform module to be consumed upstream from my team with a ton of defaults (e.g.: add sidecars automatically into folks tasks)

Thank you @cgbaker and @lgfa29 for the update. This explains a lot, I had wondered why Nomad's diff features weren't being used here and I did not understand the code itself super well.

chuckyz avatar Aug 21 '20 22:08 chuckyz

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new 😄).

lgfa29 avatar Aug 24 '20 15:08 lgfa29

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new smile).

Would it make sense to incorporate it in #149?

Legogris avatar Oct 20 '20 13:10 Legogris

Thanks for the feedback @chuckyz. Creating a new resource seems to be the best approach we have right now, so we will move forward with that (but with a better name than nomad_job_new smile).

Would it make sense to incorporate it in #149?

Yup. #149 is being developed exactly to solve this issue 👍

lgfa29 avatar Oct 29 '20 00:10 lgfa29