flyte [BUG] Failing Databricks tasks are not retried

Describe the bug

I created a Flyte task running in Databricks, task retries are set to 3 like this:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed")

Unfortunately, this task is never retried, we only have a single attempt to run the task.

Just for double-checking, I created a Python tasks running on K8 with a similar configs:

@task(task_config=databricks_conf, retries = 3)
def db_bad_luck_task():
    raise ValueError("Bad luck, this one failed")

In this case, we have 3 attempts to run the task. (this is the behavior we expect in the case of the Databricks task also)

I assume it needs to be defined in the plugin implementation what LifeCycleState / ResultState can be retried. Is that correct?

Expected behavior

Flyte will attempt to run the Databricks task 3 times.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

Apr 10 '24 14:04 rambrus

Hey @rambrus! 😄 Long time no see in the issue wilderness! How's it going in the coding jungle?

It appears that the issue you're encountering with task retries not working as expected for Databricks tasks in Flyte might be related to how the Databricks plugin or Databricks itself handles job failures, rather than an issue with Flyte's retry mechanism. Flyte's retry functionality is designed to work at a higher level and should automatically retry tasks based on the defined retry policy, without requiring specific handling within the plugin code.

If retries are functioning correctly for Kubernetes tasks but not for Databricks tasks, this suggests a potential discrepancy in how task failures are recognized or handled between the two environments. It's also possible that there's an issue with the version of the Databricks plugin you're using. Checking the failure messages or logs when a task fails on Databricks could provide more insight into why retries aren't being triggered as expected.

For further investigation, it would be beneficial to know the specific version of the Databricks plugin in use, as there may have been updates or fixes in newer versions that address this issue. Additionally, reviewing the TaskMetadata class in the Flyte codebase and the Databricks plugin code could offer more insights into how retries are handled and implemented for Databricks tasks.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Apr 10 '24 14:04 dosubot[bot]

@rambrus that's because we always return non-recoverable failure here.

This is also an expected behavior. When you raise a ValueError in the regular task, Flyte won't retry as well.

@task(retries = 3)
def t1():
    raise ValueError("Bad luck, this one failed")

Apr 11 '24 21:04 pingsutw

I did some investigation on the Databricks side and found that if the Databricks cluster start fails for any reason, Get a single job run will return this state:

...
    "state": {
        "life_cycle_state": "INTERNAL_ERROR",
        "result_state": "FAILED",
        "state_message": "<Error details>",
        "user_cancelled_or_timedout": false
    },
...

In fact, Databricks' built-in orchestration framework Databricks Workflows classifies life_cycle_state: INTERNAL_ERROR as a retryable error (see here), it would probably make sense to provide an identical behavior in Flyte.

I'm proposing this change in plugins/webapi/databricks/plugin.go:

	case "INTERNAL_ERROR":
		return core.PhaseInfoRetryableFailure(string(rune(http.StatusInternalServerError)), message, taskInfo), nil
	}

@pingsutw How does that sound to you?

Apr 19 '24 08:04 rambrus

https://github.com/flyteorg/flyte/pull/5277

Apr 23 '24 16:04 pingsutw