nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad do not clean old periodic tasks

Open TheSpbra1n opened this issue 4 years ago • 14 comments

Nomad version

Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)

Operating system and Environment details

Codename: xenial

Issue

I set in config: JobGCThreshold=48h

root@nomad-01:/home/user# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h"

But in "nomad status" I see a task that ended 6 days ago

root@nomad-01:/home/user# nomad alloc status 118cf459-8aa2-fbf2-6e61-77ed211cb531
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
Name                = correct_suspended_tickets/periodic-1619002800.correct_suspended_tickets[0]
Node ID             = 263fdfd7
Node Name           = docker-15
Job ID              = correct_suspended_tickets/periodic-1619002800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 6d6h ago
Modified            = 2h42m ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-04-21T11:01:42Z
Finished At    = 2021-04-21T11:01:44Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-04-27T17:35:21+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-27T17:35:18+03:00  Received    Task received by client
2021-04-26T21:20:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T21:20:12+03:00  Received    Task received by client
2021-04-26T01:04:06+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T01:04:04+03:00  Received    Task received by client
2021-04-24T08:20:08+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-24T08:20:00+03:00  Received    Task received by client
2021-04-22T18:38:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-21T14:01:44+03:00  Terminated  Exit Code: 0

Nomad Client logs (if appropriate)

In the clients logs, I see that the GC was running, but the allocation is actually still present:

root@docker-15:/home/user# journalctl -u nomad --since "12:00" | grep 118cf459
Apr 27 17:35:18 docker-15 nomad[2838]:     2021-04-27T17:35:18.624+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.710+0300 [INFO]  client.gc: garbage collecting allocation: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531 reason="number of allocations (251) is over the limit (200)"
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.716+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.721+0300 [DEBUG] client.gc: alloc garbage collected: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
root@docker-15:/home/user# date
Tue Apr 27 20:41:41 MSK 2021
root@docker-15:/home/user# nomad alloc status 118cf459
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
...

TheSpbra1n avatar Apr 27 '21 17:04 TheSpbra1n

Hi @TheSpbra1n! This one is tricky!

The log message you're seeing is about GC on the client node, where we clean up the allocation directory on the client. The nomad alloc status is querying the server's view of the world. That JobGCThreshold config value controls that server state.

But what I'm seeing in your nomad alloc status output is that this periodic job is getting placed every few hours, and that it has Desired Status = run. So it looks like the server expects it to be running, which is why it's not GC'ing it. Can you share a redacted version of the job spec? That might help us narrow down the behavior you're seeing.

tgross avatar Apr 27 '21 18:04 tgross

Thanks for the answer. SPEC:

# job name
job "correct_suspended_tickets" {
  region      = "global"
  datacenters = ["dc1", "dc2"]

  type = "batch"

  periodic {
    cron      = "00,30 * * * * *"
    time_zone = "Europe/Moscow"
	prohibit_overlap = true
  }

  constraint {
    attribute = "${node.class}"
    value     = "nomad1"
  }

  meta {
    # Application version
    APP_VERSION       = "0.0.1"
    START_SCRIPT = "xxx"
  }

  group "correct_suspended_tickets" {
    count = 1

    task "correct_suspended_tickets" {
      kill_signal = "SIGTERM"
      driver = "docker"

      config {
        image        = "docker-registry/container:${NOMAD_META_APP_VERSION}"
        advertise_ipv6_address = true
        network_mode = "bridge"
        command = "${NOMAD_META_START_SCRIPT}"
      }

      resources {
        memory = 512 # MB
      }
    }
  }
}

TheSpbra1n avatar Apr 28 '21 12:04 TheSpbra1n

Thanks for the jobspec @TheSpbra1n. That spec looks fine to me, which leads me back to wondering why the Desired Status = run still. The code that does job GC has some special casing around the status for periodic jobs (ref). We did fix a bug in the setting of status for periodic jobs in https://github.com/hashicorp/nomad/pull/10145 and I'm wondering if that could be at play.

If you do a nomad job status correct_suspended_tickets, does the Children Job Summary section have the correct counts? If not, you may have run into the bug described in #10145. That patch will ship in the upcoming Nomad 1.1.0 beta (very soon), and we're planning on backporting it to Nomad 1.0.5 as well.

tgross avatar Apr 28 '21 18:04 tgross

After nomad system gc, i see:

root@nomad-01:/home/user# nomad system gc
root@nomad-01:/home/user# nomad job status correct_suspended_tickets
ID                       = correct_suspended_tickets
Name                 = correct_suspended_tickets
Submit Date      = 2021-04-05T19:17:04+03:00
Type                   = batch
Priority               = 50
Datacenters      = dc1,dc2
Namespace       = default
Status                = running
Periodic             = true
Parameterized  = false
Next Periodic Launch = 2021-04-29T20:30:00+03:00 (4m16s from now)

Children Job Summary
Pending  Running  Dead
4328     0        29485

No instances of periodic job found
root@nomad-01:/home/user#

I think that "pending" count have a strange value.

TheSpbra1n avatar Apr 29 '21 17:04 TheSpbra1n

Will waiting 1.0.5, thanks!

TheSpbra1n avatar Apr 29 '21 17:04 TheSpbra1n

Thanks for that job status output @TheSpbra1n. That definitely looks suspiciously like the #10145 bug. I'm going to mark this issue for verifying once 1.0.5 ships. Thanks again!

tgross avatar Apr 29 '21 20:04 tgross

Hi @TheSpbra1n! Have you had a chance to see if 1.0.5+ fixes the issue for you?

tgross avatar Jun 03 '21 14:06 tgross

Doing some issue cleanup and I'm going to close this one out. If you do get a chance to test with Nomad 1.0.5+ or later and it doesn't fix the problem, please feel free to reopen.

tgross avatar Jul 07 '21 12:07 tgross

Thanks! We have planned an update, but haven't done it yet

TheSpbra1n avatar Jul 07 '21 13:07 TheSpbra1n

Hi @tgross! We upgraded nomad to 1.1.2

# nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Вut the problem remains:

# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h",
# nomad status correct_suspended_tickets/periodic-1627318800
ID            = correct_suspended_tickets/periodic-1627318800
Name          = correct_suspended_tickets/periodic-1627318800
Submit Date   = 2021-07-26T20:00:00+03:00
Type          = batch
Priority      = 50
Datacenters   = dc1,dc2
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
correct_suspended_tickets  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created    Modified
08a5c3d8  32e62217  correct_suspended_tickets  9        run      complete  2d22h ago  1d4h ago

Alloc status:

# nomad alloc status 08a5c3d8
ID                  = 08a5c3d8-f25d-0b56-7b67-03212862552c
Eval ID             = 438f32ce
Name                = correct_suspended_tickets/periodic-1627318800.correct_suspended_tickets[0]
Node ID             = 32e62217
Node Name           = docker-02
Job ID              = correct_suspended_tickets/periodic-1627318800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2d22h ago
Modified            = 1d4h ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-07-26T17:01:16Z
Finished At    = 2021-07-26T17:01:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-07-26T20:01:18+03:00  Terminated  Exit Code: 0
2021-07-26T20:01:16+03:00  Started     Task started by client
2021-07-26T20:00:01+03:00  Driver      Downloading image
2021-07-26T20:00:01+03:00  Task Setup  Building Task Directory
2021-07-26T20:00:01+03:00  Received    Task received by client
#

Children job count:

# nomad status correct_suspended_tickets | grep -A3 'Children Job Summary'
Children Job Summary
Pending  Running  Dead
8325     -3       33830

TheSpbra1n avatar Jul 29 '21 15:07 TheSpbra1n

https://github.com/hashicorp/nomad/issues/11017

TheSpbra1n avatar Aug 09 '21 14:08 TheSpbra1n

Re-opening since the problem persists in v1.1.2

lgfa29 avatar Aug 19 '21 17:08 lgfa29

This issue still seems reproducible in v1.4.2.

kemko avatar Nov 28 '22 16:11 kemko

Issue still there in Nomad v1.9.7. We've been running this version for over 6 months but the issue seems to have appeared just a few weeks ago. The massive amount of allocations causes the Nomad UI to be super slow. The UI fetches ~10Mb allocation responses from the different Nomad clients. I've tried stopping and purging the jobs, running nomad system gc and also restarting the Nomad servers one after another.

Is there are workaround or something that can be done to clean up the allocations?

maxramqvist avatar Nov 13 '25 21:11 maxramqvist

And still there in 1.11.1.

maxramqvist avatar Jan 08 '26 10:01 maxramqvist