nomad icon indicating copy to clipboard operation
nomad copied to clipboard

worker.service_sched: processing eval panicked scheduler - please report this as a bug

Open dpotapov opened this issue 11 months ago • 2 comments

Nomad version

Nomad v1.7.5 BuildDate 2024-02-13T15:10:13Z Revision 5f5d4646198d09b8f4f6cb90fb5d50b53fa328b8

Operating system and Environment details

RHEL 9.3

Issue

Evaluations for the job are failing:

{
  "priority": 50,
  "type": "service",
  "triggeredBy": "failed-follow-up",
  "status": "failed",
  "statusDescription": "evaluation reached delivery limit (3)",
  "failedTGAllocs": [],
  "previousEval": "34ab318b-a04d-a62b-48cc-604e265e4573",
  "nextEval": "15f7b8cf-8091-5abf-f3ed-28517af63b7a",
  "blockedEval": null,
  "modifyIndex": 39197810,
  "modifyTime": "2024-03-18T15:41:53.513Z",
  "createIndex": 39197798,
  "createTime": "2024-03-18T15:38:30.948Z",
  "waitUntil": null,
  "namespace": "default",
  "plainJobId": "exec-job",
  "relatedEvals": [
    "15f7b8cf-8091-5abf-f3ed-28517af63b7a",
    "34ab318b-a04d-a62b-48cc-604e265e4573",
    "70e22606-6d2c-b44f-8062-b3a7b5f7ca69",
    "6fa29ce2-9e16-2420-039e-7b5f8a4cd466",
    "9f1bfb6c-4984-9b6d-384e-2defd5f1a574",
    "7d362010-c877-74c9-56fc-b7b842688409",
    "cc10963b-9de1-8d6b-ec1c-eaabf0f3497a",
    "6503ee4a-5b86-d742-4597-cea96f18582e"
  ],
  "job": "[\"exec-job\",\"default\"]",
  "node": null
}

Reproduction steps

Nomad cluster was updated to 1.7.5

Expected Result

Jobs are evaluated and running

Actual Result

Jobs are never started

Job file (if appropriate)

Pretty much any job won't start.

Nomad Server logs (if appropriate)

    2024-03-18T15:25:14.668Z [ERROR] worker.service_sched: processing eval panicked scheduler - please report this as a bug!: eval_id=9f1bfb6c-4984-9b6d-384e-2defd5f1a574 job_id=exec-job namespace=default worker_id=0c9215c7-515a-eb81-7b10-a11f8abda944 eval_id=9f1bfb6c-4984-9b6d-384e-2defd5f1a574 error="runtime error: invalid memory address or nil pointer dereference"
  stack_trace=
  | goroutine 83 [running]:
  | runtime/debug.Stack()
  | \truntime/debug/stack.go:24 +0x5e
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process.func1()
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:153 +0x58
  | panic({0x2a88140?, 0x4f5ea50?})
  | \truntime/panic.go:914 +0x21f
  | github.com/hashicorp/nomad/client/lib/numalib.(*Topology).UsableCores(...)
  | \tgithub.com/hashicorp/nomad/client/lib/numalib/topology.go:258
  | github.com/hashicorp/nomad/nomad/structs.(*NodeResources).Comparable(0xc001108c80)
  | \tgithub.com/hashicorp/nomad/nomad/structs/structs.go:3185 +0xcc
  | github.com/hashicorp/nomad/scheduler.(*Preemptor).SetNode(0xc0029c48f0, 0xc00cc18000)
  | \tgithub.com/hashicorp/nomad/scheduler/preemption.go:139 +0x36
  | github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc00c176a80)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:274 +0x74d
  | github.com/hashicorp/nomad/scheduler.(*JobAntiAffinityIterator).Next(0xc00b367bd0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:624 +0x6b
  | github.com/hashicorp/nomad/scheduler.(*NodeReschedulingPenaltyIterator).Next(0xc00e4384e0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:685 +0x28
  | github.com/hashicorp/nomad/scheduler.(*NodeAffinityIterator).Next(0xc00b367c20)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:757 +0x30
  | github.com/hashicorp/nomad/scheduler.(*SpreadIterator).Next(0xc00c176af0)
  | \tgithub.com/hashicorp/nomad/scheduler/spread.go:131 +0x33
  | github.com/hashicorp/nomad/scheduler.(*PreemptionScoringIterator).Next(0xc02e7cace0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:852 +0x28
  | github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc02e7cad20)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:816 +0x28
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).nextOption(0xc008a79aa0)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:63 +0x24
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).Next(0xc008a79aa0)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:42 +0x26
  | github.com/hashicorp/nomad/scheduler.(*MaxScoreIterator).Next(0xc00e438570)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:105 +0x3e
  | github.com/hashicorp/nomad/scheduler.(*GenericStack).Select(0xc0262d92b0, 0xc00c062b40, 0xc0029c5530)
  | \tgithub.com/hashicorp/nomad/scheduler/stack.go:192 +0xe8f
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).selectNextOption(0xc00985c000, 0x38264a0?, 0xc0029c5530)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:898 +0x2d
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc00985c000, {0x526ef20, 0x0, 0x0}, {0xc00b5c5740, 0x1, 0x1}, 0x0?)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:602 +0xa47
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc00985c000)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:469 +0x14da
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc00985c000)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:289 +0x49a
  | github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc0029c5d20, 0xc0029c5d10)
  | \tgithub.com/hashicorp/nomad/scheduler/util.go:96 +0x49
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc00985c000, 0xc01c1d7680)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:188 +0x55f
  | github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc008e70ee0, 0xc0110d1e60, 0xc01c1d7680, {0xc02005da10, 0x24})
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:634 +0x353
  | github.com/hashicorp/nomad/nomad.(*Worker).run(0xc008e70ee0, 0x12a05f200)
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:463 +0x5a5
  | created by github.com/hashicorp/nomad/nomad.(*Worker).Start in goroutine 1
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:162 +0x59

Nomad Client logs (if appropriate)

N/A

dpotapov avatar Mar 18 '24 15:03 dpotapov

@dpotapov what version of Nomad are you upgrading from?

And can you describe more about the runtime environment (like are you running clients in a VM? or what architecture? etc.)

shoenig avatar Mar 18 '24 15:03 shoenig

from v1.1.4 servers and clients are amd64 VMs

I guess updating the nomad version on client should help...

dpotapov avatar Mar 18 '24 15:03 dpotapov