Nomad do not clean old periodic tasks
Nomad version
Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)
Operating system and Environment details
Codename: xenial
Issue
I set in config: JobGCThreshold=48h
root@nomad-01:/home/user# curl -s 'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
"JobGCThreshold": "48h"
But in "nomad status" I see a task that ended 6 days ago
root@nomad-01:/home/user# nomad alloc status 118cf459-8aa2-fbf2-6e61-77ed211cb531
ID = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID = 7ee96439
Name = correct_suspended_tickets/periodic-1619002800.correct_suspended_tickets[0]
Node ID = 263fdfd7
Node Name = docker-15
Job ID = correct_suspended_tickets/periodic-1619002800
Job Version = 9
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 6d6h ago
Modified = 2h42m ago
Task "correct_suspended_tickets" is "dead"
Task Resources
CPU Memory Disk Addresses
100 MHz 512 MiB 300 MiB
Task Events:
Started At = 2021-04-21T11:01:42Z
Finished At = 2021-04-21T11:01:44Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2021-04-27T17:35:21+03:00 Killing Sent interrupt. Waiting 5s before force killing
2021-04-27T17:35:18+03:00 Received Task received by client
2021-04-26T21:20:15+03:00 Killing Sent interrupt. Waiting 5s before force killing
2021-04-26T21:20:12+03:00 Received Task received by client
2021-04-26T01:04:06+03:00 Killing Sent interrupt. Waiting 5s before force killing
2021-04-26T01:04:04+03:00 Received Task received by client
2021-04-24T08:20:08+03:00 Killing Sent interrupt. Waiting 5s before force killing
2021-04-24T08:20:00+03:00 Received Task received by client
2021-04-22T18:38:15+03:00 Killing Sent interrupt. Waiting 5s before force killing
2021-04-21T14:01:44+03:00 Terminated Exit Code: 0
Nomad Client logs (if appropriate)
In the clients logs, I see that the GC was running, but the allocation is actually still present:
root@docker-15:/home/user# journalctl -u nomad --since "12:00" | grep 118cf459
Apr 27 17:35:18 docker-15 nomad[2838]: 2021-04-27T17:35:18.624+0300 [INFO] client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]: 2021-04-27T17:35:21.710+0300 [INFO] client.gc: garbage collecting allocation: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531 reason="number of allocations (251) is over the limit (200)"
Apr 27 17:35:21 docker-15 nomad[2838]: 2021-04-27T17:35:21.716+0300 [INFO] client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]: 2021-04-27T17:35:21.721+0300 [DEBUG] client.gc: alloc garbage collected: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
root@docker-15:/home/user# date
Tue Apr 27 20:41:41 MSK 2021
root@docker-15:/home/user# nomad alloc status 118cf459
ID = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID = 7ee96439
...
Hi @TheSpbra1n! This one is tricky!
The log message you're seeing is about GC on the client node, where we clean up the allocation directory on the client. The nomad alloc status is querying the server's view of the world. That JobGCThreshold config value controls that server state.
But what I'm seeing in your nomad alloc status output is that this periodic job is getting placed every few hours, and that it has Desired Status = run. So it looks like the server expects it to be running, which is why it's not GC'ing it. Can you share a redacted version of the job spec? That might help us narrow down the behavior you're seeing.
Thanks for the answer. SPEC:
# job name
job "correct_suspended_tickets" {
region = "global"
datacenters = ["dc1", "dc2"]
type = "batch"
periodic {
cron = "00,30 * * * * *"
time_zone = "Europe/Moscow"
prohibit_overlap = true
}
constraint {
attribute = "${node.class}"
value = "nomad1"
}
meta {
# Application version
APP_VERSION = "0.0.1"
START_SCRIPT = "xxx"
}
group "correct_suspended_tickets" {
count = 1
task "correct_suspended_tickets" {
kill_signal = "SIGTERM"
driver = "docker"
config {
image = "docker-registry/container:${NOMAD_META_APP_VERSION}"
advertise_ipv6_address = true
network_mode = "bridge"
command = "${NOMAD_META_START_SCRIPT}"
}
resources {
memory = 512 # MB
}
}
}
}
Thanks for the jobspec @TheSpbra1n. That spec looks fine to me, which leads me back to wondering why the Desired Status = run still. The code that does job GC has some special casing around the status for periodic jobs (ref). We did fix a bug in the setting of status for periodic jobs in https://github.com/hashicorp/nomad/pull/10145 and I'm wondering if that could be at play.
If you do a nomad job status correct_suspended_tickets, does the Children Job Summary section have the correct counts? If not, you may have run into the bug described in #10145. That patch will ship in the upcoming Nomad 1.1.0 beta (very soon), and we're planning on backporting it to Nomad 1.0.5 as well.
After nomad system gc, i see:
root@nomad-01:/home/user# nomad system gc
root@nomad-01:/home/user# nomad job status correct_suspended_tickets
ID = correct_suspended_tickets
Name = correct_suspended_tickets
Submit Date = 2021-04-05T19:17:04+03:00
Type = batch
Priority = 50
Datacenters = dc1,dc2
Namespace = default
Status = running
Periodic = true
Parameterized = false
Next Periodic Launch = 2021-04-29T20:30:00+03:00 (4m16s from now)
Children Job Summary
Pending Running Dead
4328 0 29485
No instances of periodic job found
root@nomad-01:/home/user#
I think that "pending" count have a strange value.
Will waiting 1.0.5, thanks!
Thanks for that job status output @TheSpbra1n. That definitely looks suspiciously like the #10145 bug. I'm going to mark this issue for verifying once 1.0.5 ships. Thanks again!
Hi @TheSpbra1n! Have you had a chance to see if 1.0.5+ fixes the issue for you?
Doing some issue cleanup and I'm going to close this one out. If you do get a chance to test with Nomad 1.0.5+ or later and it doesn't fix the problem, please feel free to reopen.
Thanks! We have planned an update, but haven't done it yet
Hi @tgross! We upgraded nomad to 1.1.2
# nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
Вut the problem remains:
# curl -s 'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
"JobGCThreshold": "48h",
# nomad status correct_suspended_tickets/periodic-1627318800
ID = correct_suspended_tickets/periodic-1627318800
Name = correct_suspended_tickets/periodic-1627318800
Submit Date = 2021-07-26T20:00:00+03:00
Type = batch
Priority = 50
Datacenters = dc1,dc2
Namespace = default
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
correct_suspended_tickets 0 0 0 0 1 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
08a5c3d8 32e62217 correct_suspended_tickets 9 run complete 2d22h ago 1d4h ago
Alloc status:
# nomad alloc status 08a5c3d8
ID = 08a5c3d8-f25d-0b56-7b67-03212862552c
Eval ID = 438f32ce
Name = correct_suspended_tickets/periodic-1627318800.correct_suspended_tickets[0]
Node ID = 32e62217
Node Name = docker-02
Job ID = correct_suspended_tickets/periodic-1627318800
Job Version = 9
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 2d22h ago
Modified = 1d4h ago
Task "correct_suspended_tickets" is "dead"
Task Resources
CPU Memory Disk Addresses
100 MHz 512 MiB 300 MiB
Task Events:
Started At = 2021-07-26T17:01:16Z
Finished At = 2021-07-26T17:01:18Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2021-07-26T20:01:18+03:00 Terminated Exit Code: 0
2021-07-26T20:01:16+03:00 Started Task started by client
2021-07-26T20:00:01+03:00 Driver Downloading image
2021-07-26T20:00:01+03:00 Task Setup Building Task Directory
2021-07-26T20:00:01+03:00 Received Task received by client
#
Children job count:
# nomad status correct_suspended_tickets | grep -A3 'Children Job Summary'
Children Job Summary
Pending Running Dead
8325 -3 33830
https://github.com/hashicorp/nomad/issues/11017
Re-opening since the problem persists in v1.1.2
This issue still seems reproducible in v1.4.2.
Issue still there in Nomad v1.9.7. We've been running this version for over 6 months but the issue seems to have appeared just a few weeks ago.
The massive amount of allocations causes the Nomad UI to be super slow. The UI fetches ~10Mb allocation responses from the different Nomad clients.
I've tried stopping and purging the jobs, running nomad system gc and also restarting the Nomad servers one after another.
Is there are workaround or something that can be done to clean up the allocations?
And still there in 1.11.1.