nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Multiple S3 artifacts downloads fail

Open freef4ll opened this issue 2 years ago • 3 comments

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

Ubuntu 22.04

Issue

The job contains multiple artifacts that are fetched over an S3 interface (local MinIO install), and multiple instances of the job is executed and what becomes apparent is that quite often the download will simply fail. Combined with disabling of an intentional re-scheduling and restarts, the issue becomes very frequent.

The issue doesn't reproduce when downloading over HTTP, and the issue may be that go-getter expects a clean directory - https://github.com/hashicorp/go-getter/blob/main/get_s3.go#L95-L106 ?

Reproduction steps

  1. Setup a local S3 MinIO or LocalStack (not tried)
  2. Launch in a loop:
while true; do nomad stop downloader; nomad run download-error.nomad; date; sleep 3; done
  1. See that not all jobs have been able to start up

Expected Result

Actual Result

Job file (if appropriate)

variables {
  config     = "s3::http://local-minio:9000/test/config"
  release    = "s3::http://local-minio:9000/test/release"
  aws_key    = "key"
  aws_secret = "secret"
}

job "downloader" {
  datacenters = ["dc1"]
  group "downloader" {
    count = 3
    restart {
      attempts = "0"
      mode     = "fail"
    }
    reschedule {
      attempts  = "0"
      interval  = "3m"
      delay     = "30s"
      unlimited = false
    }
    task "artifact" {
      driver = "docker"
      config {
        image = "hashicorp/http-echo:alpine"
        args  = ["-text", "Hello World!"]
      }
      artifact {
        source = var.config
        #      destination = "local/config"
        options {
          aws_access_key_id     = var.aws_key
          aws_access_key_secret = var.aws_secret
        }
      }

      artifact {
        source = var.release
        #      destination = "local/release"
        options {
          aws_access_key_id     = var.aws_key
          aws_access_key_secret = var.aws_secret
        }
      }
      resources {
        memory = 10
        cpu    = 10
      }
    }
  }
  update {
    max_parallel      = "3"
    min_healthy_time  = "5s"
    healthy_deadline  = "45s"
    progress_deadline = "0"
    health_check      = "task_states"
  }
  meta {
    run_uuid = "${uuidv4()}"
  }
#  constraint {
#    attribute = "${node.unique.name}"
#    value     = "my-node"
#  }
}

Sample job to bring up a MinIO instance:

# mc alias set local-minio http://local-minio:9000 minioadmin minioadmin
# mc mb local-minio/test
# mc admin user add local-minio key secret
# mc admin policy attach local-minio readwrite --user key

job "minio" {
  datacenters = ["dc1"]
  type        = "service"

  group "minio" {
    network {
      port "minio-s3" {
        static = 9000
      }
      port "minio-admin" {
        static = 9001
      }
    }

    task "minio-task" {
      driver = "docker"
      config {
        image         = "minio/minio:latest"
        args          = ["server", "/data", "--console-address", ":9001"]
        ports         = ["minio-s3", "minio-admin"]
        volumes       = ["minio-data:/data"]
        volume_driver = "local"
      }

      env {
        MINIO_ROOT_USER     = "minioadmin"
        MINIO_ROOT_PASSWORD = "minioadmin"
      }

      resources {
        cpu    = 50
        memory = 900
      }

      service {
        name = "minio-storage"
        tags = ["minio", "urlprefix-/minio", "urlprefix-:9000 proto=tcp"]
        port = "minio-s3"
        check {
          type     = "tcp"
          interval = "60s"
          timeout  = "5s"
        }
      }
    }
  }
  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }
}

Generate some random small files to fetch and upload to the appropriate bucket:

mkdir -p config; cd config; head -c 100M /dev/urandom | split -b 5M - conf-; cd -
mkdir -p release; cd release; head -c 200M /dev/urandom | split -b 15M - bin-; cd -

Nomad Server logs (if appropriate)

The 2-3 downloads are happening at the same time, and are racing with each other and appear to be deleting files that they did not download:

2023-09-01T12:32:30.531Z [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=5c71bfbd-f558-2258-ce40-cd5922d96114 task=artifact artifact=s3::http://minio:9000/smoketest/test/release aid=aNUpzJ5Jm3rHA8YDB078IjiERLKDtN2JESdPxC7KqZk
2023-09-01T12:32:30.531Z [DEBUG] client.artifact: get: source=s3::http://minio:9000/smoketest/test/release destination=local/
2023-09-01T12:32:30.568Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
2023-09-01T12:32:30.581Z [DEBUG] client.artifact: sub-process: OUTPUT="artifact download was a success"
2023-09-01T12:32:30.581Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=78a6ae90-018d-3916-494c-3651b4550bd0 task=artifact type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
2023-09-01T12:32:30.589Z [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact artifact=s3::http://minio:9000/smoketest/test/config aid=HXHv8/JexVchiUDSg7IMJtsY5MvujRXmfT5BC79JSDY
2023-09-01T12:32:30.589Z [DEBUG] client.artifact: get: source=s3::http://minio:9000/smoketest/test/config destination=local/
2023-09-01T12:32:30.589Z [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact artifact=s3::http://minio:9000/smoketest/test/release aid=aNUpzJ5Jm3rHA8YDB078IjiERLKDtN2JESdPxC7KqZk
2023-09-01T12:32:30.590Z [DEBUG] client.artifact: get: source=s3::http://minio:9000/smoketest/test/release destination=local/
2023-09-01T12:32:30.597Z [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=78a6ae90-018d-3916-494c-3651b4550bd0 task=artifact artifact=s3::http://minio:9000/smoketest/test/config aid=HXHv8/JexVchiUDSg7IMJtsY5MvujRXmfT5BC79JSDY
2023-09-01T12:32:30.597Z [DEBUG] client.artifact: get: source=s3::http://minio:9000/smoketest/test/config destination=local/
2023-09-01T12:32:30.597Z [DEBUG] client.alloc_runner.task_runner.task_hook.artifacts: downloading artifact: alloc_id=78a6ae90-018d-3916-494c-3651b4550bd0 task=artifact artifact=s3::http://minio:9000/smoketest/test/release aid=aNUpzJ5Jm3rHA8YDB078IjiERLKDtN2JESdPxC7KqZk
2023-09-01T12:32:30.598Z [DEBUG] client.artifact: get: source=s3::http://minio:9000/smoketest/test/release destination=local/
2023-09-01T12:32:30.699Z [DEBUG] client.artifact: sub-process: OUTPUT="artifact download was a success"
2023-09-01T12:32:30.702Z [DEBUG] client.artifact: sub-process: OUTPUT="artifact download was a success"
2023-09-01T12:32:31.198Z [ERROR] client.artifact: sub-process: OUTPUT="failed to download artifact: error downloading 'http://minio:9000/smoketest/test/release?aws_access_key_id=key&aws_access_key_secret=secret': chmod /var/nomad/alloc/79bc7b23-2acc-9d54-e031-aadddcd5907e/artifact/local/bin-aa: no such file or directory"
2023-09-01T12:32:31.198Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact type="Failed Artifact Download" msg="failed to download artifact \"s3::http://minio:9000/smoketest/test/release\": getter subprocess failed: exit status 1" failed=false
2023-09-01T12:32:31.254Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact error="prestart hook \"artifacts\" failed: failed to download artifact \"s3::http://minio:9000/smoketest/test/release\": getter subprocess failed: exit status 1"
2023-09-01T12:32:31.254Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact reason="Policy allows no restarts"
2023-09-01T12:32:31.254Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79bc7b23-2acc-9d54-e031-aadddcd5907e task=artifact type="Not Restarting" msg="Policy allows no restarts" failed=true

Nomad Client logs (if appropriate)

Failed Artifact Download | failed to download artifact "s3::http://minio:9000/smoketest/test/release": getter subprocess failed: exit status 1

freef4ll avatar Sep 01 '23 12:09 freef4ll

Hi @freef4ll! I was able to reproduce this as you've shown here. Weirdly, I can only reproduce when there are both multiple allocations and multiple artifact blocks. Here's the client logs I got:

client logs
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact type=Received msg="Task received by client" failed=false
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact type="Task Setup" msg="Building Task Directory" failed=false
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact type=Received msg="Task received by client" failed=false
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1e3c11ca-e67d-9658-17fe-b312d69ef822 task=artifact type=Received msg="Task received by client" failed=false
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact type="Task Setup" msg="Building Task Directory" failed=false
   2023-09-08T15:45:03.725-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1e3c11ca-e67d-9658-17fe-b312d69ef822 task=artifact type="Task Setup" msg="Building Task Directory" failed=false
   2023-09-08T15:45:03.750-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
   2023-09-08T15:45:03.751-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1e3c11ca-e67d-9658-17fe-b312d69ef822 task=artifact type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
   2023-09-08T15:45:03.757-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
   2023-09-08T15:45:03.799-0400 [ERROR] client.artifact: sub-process: OUTPUT="failed to download artifact: error downloading 'http://192.168.1.221:9000/test/release?aws_access_key_id=key&aws_access_key_secret=password1': chmod /run/nomad/dev/data/alloc/a8c51288-5d8b-491e-7743-d961cb9acc7f/artifact/local/bin-aa: no such file or directory"
   2023-09-08T15:45:03.801-0400 [ERROR] client.artifact: sub-process: OUTPUT="failed to download artifact: error downloading 'http://192.168.1.221:9000/test/release?aws_access_key_id=key&aws_access_key_secret=password1': chmod /run/nomad/dev/data/alloc/90de3bc5-4c94-343d-b1f0-92b3061bc856/artifact/local/bin-aa: no such file or directory"
   2023-09-08T15:45:03.840-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact type="Failed Artifact Download" msg="failed to download artifact \"s3::http://192.168.1.221:9000/test/release\": getter subprocess failed: exit status 1" failed=false
   2023-09-08T15:45:03.841-0400 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact error="prestart hook \"artifacts\" failed: failed to download artifact \"s3::http://192.168.1.221:9000/test/release\": getter subprocess failed: exit status 1"
   2023-09-08T15:45:03.841-0400 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact reason="Policy allows no restarts"
   2023-09-08T15:45:03.841-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=90de3bc5-4c94-343d-b1f0-92b3061bc856 task=artifact type="Not Restarting" msg="Policy allows no restarts" failed=true
   2023-09-08T15:45:03.851-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact type="Failed Artifact Download" msg="failed to download artifact \"s3::http://192.168.1.221:9000/test/release\": getter subprocess failed: exit status 1" failed=false
   2023-09-08T15:45:03.851-0400 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact error="prestart hook \"artifacts\" failed: failed to download artifact \"s3::http://192.168.1.221:9000/test/release\": getter subprnocess failed: exit status 1"
   2023-09-08T15:45:03.851-0400 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact reason="Policy allows no restarts"
   2023-09-08T15:45:03.851-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a8c51288-5d8b-491e-7743-d961cb9acc7f task=artifact type="Not Restarting" msg="Policy allows no restarts" failed=true
   2023-09-08T15:45:03.991-0400 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=bb155e07ef7924c276e405c82d6f91e3a4416d55b9645e072d72e99db9e6d8ec
   2023-09-08T15:45:04.210-0400 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=bb155e07ef7924c276e405c82d6f91e3a4416d55b9645e072d72e99db9e6d8ec
   2023-09-08T15:45:04.229-0400 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1e3c11ca-e67d-9658-17fe-b312d69ef822 task=artifact type=Started msg="Task started by client" failed=false

Allocations are sandboxed and their downloads will be into entirely separate allocation directories, so there's no way that go-getter could be interfering across allocations. But I suspect what's happening here is that when there are multiple allocations, Minio is getting slowed down enough that whatever timing effect we're hitting between artifact blocks of the same allocation now appears.

This definitely feels like a race condition to me. Fortunately, I was able to workaround it by ensuring that Nomad creates the directory instead of go-getter by changing the artifact blocks as follows:

artifact {
  source      = var.config
  destination = "local/config"
  mode        = "dir"
  options {
    aws_access_key_id     = var.aws_key
    aws_access_key_secret = var.aws_secret
  }
}

artifact {
  source      = var.release
  destination = "local/release"
  mode        = "dir"
  options {
    aws_access_key_id     = var.aws_key
    aws_access_key_secret = var.aws_secret
  }
}

That gets me the following directories in the allocations, which look right to me:

$ nomad alloc fs de64 artifact/local
Mode        Size   Modified Time              Name
drwxr-xr-x  240 B  2023-09-08T15:55:55-04:00  config/
drwxr-xr-x  180 B  2023-09-08T15:55:55-04:00  release/

$ nomad alloc fs de64 artifact/local/config
Mode        Size     Modified Time              Name
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-aa
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ab
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ac
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ad
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ae
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-af
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ag
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ah
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-ai
-rw-rw-rw-  5.0 MiB  2023-09-08T15:55:55-04:00  conf-aj

$ nomad alloc fs de64 artifact/local/release
Mode        Size    Modified Time              Name
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-aa
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-ab
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-ac
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-ad
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-ae
-rw-rw-rw-  15 MiB  2023-09-08T15:55:55-04:00  bin-af
-rw-rw-rw-  10 MiB  2023-09-08T15:55:55-04:00  bin-ag
$

So that should workaround for you. And in the meantime, I'll mark this issue for roadmapping.

tgross avatar Sep 08 '23 19:09 tgross

Thanks for the detailed analysis @tgross !

Unfortunately getting Nomad to create the directory as suggested still has artifact download errors, albeit maybe not as frequent.

freef4ll avatar Sep 11 '23 07:09 freef4ll

This error could be due to old version of go-getter that Nomad is using. Nomad is using go-getter module 1.7.0 with the static checking of s3 path through splitting [.]. Newer version of go-getter 2.20 should fix the issue. So if Nomad use the latest version of go-getter. Issue is likely to be fixed

MinhDBui avatar Mar 26 '24 12:03 MinhDBui