terraform-provider-docker
terraform-provider-docker copied to clipboard
Image build fails if it takes longer than 20 minutes
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform (and docker Provider) Version
Terraform v1.0.3 on darwin_amd64
- provider registry.terraform.io/kreuzwerker/docker v2.14.0
Affected Resource(s)
docker_imagedocker_registry_image
Expected Behaviour
The image is built by Terraform.
Actual Behaviour
...
module.foo.docker_registry_image.this["pytorch-gpu"]: Still creating... [19m50s elapsed]
╷
│ Error: Error building docker image: context deadline exceeded
│
│ with module.foo.docker_registry_image.this["pytorch-gpu"],
│ on ../main.tf line 153, in resource "docker_registry_image" "this":
│ 153: resource "docker_registry_image" "this" {
│
╵
Steps to Reproduce
Try to build a Docker image in Terraform for the following Dockerfile:
FROM busybox:latest
RUN sleep 1201
Important Factoids
The timeout includes the uploading of the image.
Building a Docker image for a data science environment usually means downloading or compiling big software packages (Pytorch+CUDA in my case) that result in 2-6 Gb images. The time of building and uploading such a image sometimes exceeds 20 minutes.
The workaround is to build it with docker build, docker tag and docker push and hopefully Docker will reply quick on your next terraform apply. It's not always the case for me but it's likely due to my setup.
Also, I haven't found a way to sneak peek into the progress of creating a docker_registry_image so I had put a Unix socket "proxy" up with socat -d -v -d TCP-L:2375,fork UNIX:/var/run/docker.sock and point the Terraform Docker provider to tcp://localhost:2375. Is there a better way?
I could reproduce the similar error with docker_image resource.
$ terraform version
Terraform v1.0.3
on darwin_amd64
+ provider registry.terraform.io/kreuzwerker/docker v2.14.0
docker version
$ docker version
Client:
Cloud integration: 1.0.17
Version: 20.10.7
API version: 1.41
Go version: go1.16.4
Git commit: f0df350
Built: Wed Jun 2 11:56:22 2021
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:54:58 2021
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
main.tf
resource "docker_image" "zoo" {
name = "zoo"
build {
path = "."
}
}
terraform {
required_providers {
docker = {
source = "kreuzwerker/docker"
version = "2.14.0"
}
}
}
provider "docker" {
}
Dockerfile
FROM busybox:latest
RUN sleep 1201
$ TF_LOG=debug terraform apply -auto-approve
docker_image.zoo: Still creating... [19m40s elapsed]
docker_image.zoo: Still creating... [19m50s elapsed]
2021-08-02T20:31:08.545+0900 [INFO] provider.terraform-provider-docker_v2.14.0: 2021/08/02 20:31:08 [DEBUG] Step 1/2 : FROM busybox:latest
latest: Pulling from library/busybox
b71f96345d44: Pulling fs layer
b71f96345d44: Download complete
b71f96345d44: Pull complete
Digest: sha256:0f354ec1728d9ff32edcd7d1b8bbdfc798277ad36120dc3dc683be44524c8b60
Status: Downloaded newer image for busybox:latest
---> 69593048aa3a
Step 2/2 : RUN sleep 1201
---> Running in 761974235ec3: timestamp=2021-08-02T20:31:08.545+0900
╷
│ Error: Unable to read Docker image into resource: unable to list Docker images: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/images/json": context deadline exceeded
│
│ with docker_image.zoo,
│ on main.tf line 1, in resource "docker_image" "zoo":
│ 1: resource "docker_image" "zoo" {
│
╵
2021-08-02T20:31:08.575+0900 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2021-08-02T20:31:08.577+0900 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/kreuzwerker/docker/2.14.0/darwin_amd64/terraform-provider-docker_v2.14.0 pid=31909
2021-08-02T20:31:08.577+0900 [DEBUG] provider: plugin exited
Debug log: https://gist.github.com/suzuki-shunsuke/1e56152acad81333cbab0b47bc6fa92b
https://github.com/kreuzwerker/terraform-provider-docker/blob/934bc2935f6cf0890a222c6c8acbbca7cd56739b/internal/provider/resource_docker_image_funcs.go#L38-L40
https://github.com/kreuzwerker/terraform-provider-docker/blob/934bc2935f6cf0890a222c6c8acbbca7cd56739b/internal/provider/resource_docker_image_funcs.go#L150-L153
But I can't find the timeout setting.
The timeout itself comes from https://github.com/hashicorp/terraform-plugin-sdk/blob/112e2164c381d80e8ada3170dac9a8a5db01079a/helper/schema/resource_data.go#L409-L415.
We might need a separate timeout block: https://www.terraform.io/docs/language/resources/syntax.html#operation-timeouts
FYI,
The workaround is to build it with
docker build,docker taganddocker pushand hopefully Docker will reply quick on your nextterraform apply. It's not always the case for me but it's likely due to my setup.
I figured out that docker build . does use BuildKit where this provider doesn't so that is probably the reason why they didn't share the build caches.
export DOCKER_BUILDKIT=0 solved it for me.
This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.
This seems like a trivial change. I haven't contributed to this repo but if no one is looking into the issue, I might try.
This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.
Oh, I forgot about this one. I will have a look the next week.
This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.
Could this be reopened?
Any plans to specify a custom timeout larger than 20 minutes?
Is this possible to fix?
I could use a fix as well. I see a PR is waiting
I need this too.
We need it too. My docker build downloads many pip packages and it takes more than 20 mins and fails with terraform. This can be fixed by implementing, https://developer.hashicorp.com/terraform/plugin/sdkv2/resources/retries-and-customizable-timeouts