nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Task getting killed with OOM error is marked as complete

Open vikramsg opened this issue 8 months ago • 3 comments

Nomad version

Nomad v1.5.2

Operating system and Environment details

Running on AWS.

Issue

We have various batch jobs running on NOMAD which runs on EC2 instances. Now we are connecting up Airflow to Nomad, so we don't want Nomad to handle restarts and reschedules but for this we want to accurately know if a job completed or failed.

This mostly works, but I am seeing on OOM errors that Nomad marks the job as complete. Screenshot 2024-06-21 at 16 57 18

Expected Result

  1. If a job fails due to Nomad killing it, it should not be marked as complete.
  2. Alternatively how do we determine if it was killed due to OOM.
  3. Also, even though we have reschedule and restart blocks set to 0, Nomad is still trying to run the job again.
    reschedule {
      attempts  = 0
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

Actual Result

Nomad marks the job as complete and restarts the job.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

vikramsg avatar Jun 21 '24 15:06 vikramsg