Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Thunk retry options and other supervisors

Open chris-b1 opened this issue 4 years ago • 1 comments

This may or may not make sense at the Dagger level, but for consideration - as an example copying the prefect keywords below

import Dates
import Dagger

function flaky_function()
    # accesses some network resource that could fail
end

res = Dagger.@spawn max_retries=3 retry_delay=Dates.Minute(1) flaky_function()

chris-b1 avatar Dec 22 '21 20:12 chris-b1

This is something we should have somewhere, but not necessarily in Dagger's core, since we may want more "supervisory actions" than just retries and delay-based retry. For example, we might want to trigger a retry based on an active signal (such as an error asynchronously delivered via a library API). Or we might want to retry with backoff, or do a more complicated set of failure recovery steps that depends on the state of multiple thunks.

Instead of building this in directly, this functionality could be implemented with a supervisor thunk which launches and monitors flaky_function:

function supervisor(f, args...)
    h = Dagger.Sch.sch_handle()
    res = nothing
    for i in 1:3
        try
            return fetch(Dagger.@spawn f(args...))
        catch err
            if i == 3
                rethrow(err)
            else
                @debug "Failed to execute $f on iteration $i, retrying in 1 second..."
                sleep(1)
            end
        end
    end
end

function flaky_function(x, y, z)
    if rand() < 0.5
        return x + y + z
    else
        error("Transient error")
    end
end

fetch(Dagger.@spawn supervisor(flaky_function, 1, 2, 3))

We could put such supervisor functions into their own package (which could be a subpackage of this repo), maybe DaggerSupervisors.jl?.

jpsamaroo avatar Dec 27 '21 16:12 jpsamaroo