Towards Tracking Server API for auto-resume of previously failed app executions
Is your feature request related to a problem? Please describe.
I'd like to have a helper script that will be useful for quick prototyping (faster iterations / ad-hoc hypothesis testing) of multi-step Burr FSM-based workflows with a few heavy (time and computer consuming) steps/actions. It should accelerate the following use case:
- developer works on Burr workflow A->B->C, where B takes, say, 20 minutes.
- developer runs a 'main' script with Burr app inside, waits 20 minutes and see a failure on step C
- developer fixes the bug in C and re-runs the script on the same inputs
- developer waits 20 minutes again for B being re-computed... and then it finally goes to C, and then... it depends 🤷
- instead, developer wants to resume execution from C, since A and B did not change, and inputs did not change.
It is very similar to what is shown here - this notebook https://github.com/DAGWorks-Inc/burr/blob/main/examples/multi-modal-chatbot/burr_demo.ipynb , see usage of initialize_from and with_identifiers but without need to manually deal with application_id and sequence_ids.
From one view point it is kinda caching problem (one of 2 oldest, right?), but we have a Burr tracking server that solves this problem 😎 From another point of view – I am not sure that it should be part of burr "core" since it is more about serving, or like a "high-level" Burr application script dealing with bunch of burr-interfaced services like the tracking server.
So that's why I am looking for missing tracking server API operations to make the following possible: a script (kinda Burr app/graph runner) that
-
imports a Burr graph definition from some project module 2a. it checks for script flag
--no-resume. If it is present then it just runs Burr app for the given inputs, entrypoint and halt config – just pass it through from the script arguments to Burr app builder. 2b. If no--no-resumeis present (by default) then it connects to a Burr tracking server instance given URL inBURR_TRACKING_SERVER_URLenv variable. -
It takes a Burr project name from
BURR_PROJECT_NAMEenv variable. -
It uses tracking server API to fetch latest trace for the configured project name and the same inputs.
-
If it is found and it is in failed state, it tries to resume execution initializing state from this trace using state right before the failure.
Describe the solution you'd like A documented API of the Burr tracking server(s), with minimal set of operations required to make the aforementioned script happen.
Describe alternatives you've considered TODO
Additional context None