terraform-provider-external icon indicating copy to clipboard operation
terraform-provider-external copied to clipboard

Long running external program always fails

Open seventieskid opened this issue 3 years ago • 3 comments

Terraform CLI and Provider Versions

Terraform v1.3.1
on darwin_amd64
+ provider registry.terraform.io/hashicorp/external v2.2.2

Terraform Configuration

variable "sleep_mins" {
  type = string
  default = "0"
}

data "external" "test" {
  program = ["go", "run", "main.go"]

  query = {
    sleepMins = var.sleep_mins
  }
}

output "nap_duration" {
  value = format("Slept for %s minutes", data.external.test.result["sleepMins"])
}

Expected Behavior

Terraform (and external program) run successfully.

Returns:-

Outputs:

nap_duration = "Slept for 30 minutes"

Actual Behavior

  • When the invoked external program (main.go) runs for 30 mins or 60 mins, it returns the below error 100% of the time.
│ Error: External Program Execution Failed
│ 
│   with data.external.test,
│   on data.tf line 2, in data "external" "test":
│    2:   program = ["go", "run", "main.go"]
│ 
│ The data source received an unexpected error while attempting to execute the program.
│ 
│ The program was executed, however it returned no additional error messaging.
│ 
│ Program: /usr/local/go/bin/go
│ State: signal: killed
  • For execution durations at 0mins, 10mins, 15mins, 20mins, it runs successfully 100% of the time.

Steps to Reproduce

  1. terraform init
  2. terraform apply -var sleeps_mins=30

How much impact is this issue causing?

The only way around the problem is to drop back to v2.1.0 of the external provider. That's fine, and low impact for now.

Logs

https://gist.github.com/seventieskid/5db0865d84d0796e6893f32560b9997a

Additional Information

TF trace showing error (30 minute sleep in external program): https://gist.github.com/seventieskid/5db0865d84d0796e6893f32560b9997a#file-tf_30mins_fail-log

TF trace showing success (no sleep in external program):-

https://gist.github.com/seventieskid/406baf458ff50dcb9adb6ac3ff870435#file-tf_nosleep-pass-log

External program is here:-

https://gist.github.com/seventieskid/440cd4f949030f0c17dddb0c2061c5ea#file-main-go

Use Case

We have an external program (main.go) that invokes a tool that takes 50 mins to complete. So we must wait for it to complete. Terraform needs to interpret the output of that tool, hence we've opted for the TF external provider.

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

seventieskid avatar Oct 17 '22 14:10 seventieskid

Hi @seventieskid 👋 Thank you for raising this and sorry you ran into trouble here.

The additional use case context is helpful here since it seems you need the functionality of this provider to run a command, rather than using something like the time_sleep resource to just introduce a waiting period into your Terraform configuration. 👍

One thing to double check on your end is to ensure that the operating system is not killing the process due to resource constraints, such as running out of memory. For example, Linux distributions tend to use the oom-killer process for handling these situation, which will leave system logs about processes that are killed in this manner.

That said, I'm guessing it is an implementation detail of os/exec.Command where it surfaces that the process was killed when the context.Context deadline is reached, rather than displaying the normal context: deadline exceeded error in that situation. The external data source is currently implemented using terraform-plugin-sdk and its schema.Resource type ReadContext field. This implicitly introduces a 20 minute data source timeout by default. The implementation can be switched to the ReadWithTimeout field to remove that 20 minute timeout. I can submit that change in a few minutes to remove that potential issue.

bflad avatar Oct 17 '22 16:10 bflad

@bflad - thanks for the quick reply. Yea, the time_zleep resource is not of use for my use case. I just threw a sleep into my cut-down example to provide a trival example of the problem.

Your explanation about context timing out is completely consistent with the behaviour I'm seeing and one of the few things that did change between the working version 2.1.0 and 2.2.2.

I guess the new ReadWithTimeout would need an additional configurable optional parameter on the provider ?

seventieskid avatar Oct 17 '22 18:10 seventieskid

I can confirm that I see similar killed process behaviors locally on macOS:

terraform {
  required_providers {
    external = {
      source  = "hashicorp/external"
      version = "2.2.2"
    }
  }
}

data "external" "test" {
  program = ["sleep", "1201"]
}
$ terraform apply
...
│ Error: External Program Execution Failed
│ 
│   with data.external.test,
│   on main.tf line 11, in data "external" "test":
│   11:   program = ["sleep", "1201"]
│ 
│ The data source received an unexpected error while attempting to execute the program.
│ 
│ The program was executed, however it returned no additional error messaging.
│ 
│ Program: /bin/sleep
│ State: signal: killed

I have submitted #165 with the fix and a covering test. I don't believe the provider implementation needs an additional flag whether to enable the prior 20 minute timeout behavior, since it is fixing a confusing implementation detail. If practitioners do want to raise timeout errors based on a configurable timeout, then that can be submitted as a separate feature request.

bflad avatar Oct 17 '22 20:10 bflad

Version 2.2.3 was released today with this fix. Thank you again for the report.

bflad avatar Nov 09 '22 20:11 bflad

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar May 23 '24 09:05 github-actions[bot]