chronos icon indicating copy to clipboard operation
chronos copied to clipboard

The task has "retries": 0 but still retries.

Open jinzhao1994 opened this issue 9 years ago • 6 comments

I add a very simple task when I test. The json I post is this.

{
  "retries": 0,
  "disable": false,
  "command": "cd /var/log/tiger; pwd; hostname; date; exit -1",
  "name": "test1",
  "schedule": "R/2016-03-28T14:58:30.0+08:00/PT2M",
  "description": "description test"
}

I think it will fail, wait 2 minutes, then try another time. But in fact, it tries almost once a second.

But some of my normal task, also with retries zero, works as my expect.

The version of Mesos is 0.28.0 The version of Chronos is 2.4.0

jinzhao1994 avatar Mar 28 '16 11:03 jinzhao1994

You scheduled the task to run every 2 minutes.

Califax avatar Mar 29 '16 22:03 Califax

@Califax Yes, but the OP reports the job runs "almost once a second" in case of failure, and despite the fact retries is set to 0.

ddossot avatar Mar 31 '16 15:03 ddossot

I don't know when will it happen. I tried to make it happen this afternoon, but it works right now.

jinzhao1994 avatar Mar 31 '16 16:03 jinzhao1994

Sorry I misread the fact it retries almost once a second on failure. I have not seen this behavior unless the job had a shorter repeat interval than the look ahead horizon and the job did have retries spawning multiple copies of the job. Given it is every 2 minutes, and you have 0 retries, this is something else. Let us know if you are able to reproduce.

Califax avatar Apr 01 '16 15:04 Califax

My QA team has been able to reproduce the issue by doing the following:

  1. Created a Synchronous job with:

    Retry Count - 0 Repeat interval - 2 mins Repeat Count - 3

  2. Set the Start time as 2016-07-06 16:10 America/Los_Angeles

  3. Job started running at scheduled time and got SUCCESS

  4. Immediately disabled (not deleted) the Job (Before the second iteration starts) and changed it so future runs will fail. Then enabled it again.

  5. Job immediately runs again and fails (as expected). But It was supposed to be run at 16:12 as the Repeat interval was mentioned as 2 mins. After this, it got run multiple times (50) rather the repeat count was just 3 and retry count was 0.

It seems that at step 5, job is retried every second until the next horizon minute is reached (step 4 took ~10 seconds to perform).

dandew avatar Jul 07 '16 18:07 dandew

I've not been able to get retries to work as expected. I have "retries": 2, but I either get:

  1. No retries at all on failure
  2. Infinite retries on failure

Here's a job which was seeing infinite retries:

[
  {
    "name": "my-job",
    "command": "/usr/local/deploy/bin/run_job 'do-that-thing'",
    "shell": true,
    "epsilon": "PT60S",
    "executor": "",
    "executorFlags": "",
    "retries": 2,
    "owner": "[email protected]",
    "ownerName": "",
    "description": "",
    "async": false,
    "successCount": 0,
    "errorCount": 20,
    "lastSuccess": "",
    "lastError": "2017-11-07T22:46:09.345Z",
    "cpus": 0.1,
    "disk": 256,
    "mem": 2048,
    "disabled": false,
    "softError": false,
    "dataProcessingJobType": false,
    "errorsSinceLastSuccess": 20,
    "uris": [
      "file:///etc/mesos/.dockercfg"
    ],
    "environmentVariables": [],
    "arguments": [],
    "highPriority": false,
    "runAsUser": "root",
    "container": {
      "type": "docker",
      "image": "quay.io/my-containers/my-container:tag",
      "network": "BRIDGE",
      "volumes": [],
      "forcePullImage": false
    },
    "constraints": [],
    "schedule": "R//P1D",
    "scheduleTimeZone": "America/Los_Angeles"
  }
]

deanmorin avatar Nov 07 '17 23:11 deanmorin