Towards Tracking Server API for auto-resume of previously failed app executions

Open rgareev opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe.

I'd like to have a helper script that will be useful for quick prototyping (faster iterations / ad-hoc hypothesis testing) of multi-step Burr FSM-based workflows with a few heavy (time and computer consuming) steps/actions. It should accelerate the following use case:

developer works on Burr workflow A->B->C, where B takes, say, 20 minutes.
developer runs a 'main' script with Burr app inside, waits 20 minutes and see a failure on step C
developer fixes the bug in C and re-runs the script on the same inputs
developer waits 20 minutes again for B being re-computed... and then it finally goes to C, and then... it depends 🤷
instead, developer wants to resume execution from C, since A and B did not change, and inputs did not change.

It is very similar to what is shown here - this notebook https://github.com/DAGWorks-Inc/burr/blob/main/examples/multi-modal-chatbot/burr_demo.ipynb , see usage of initialize_from and with_identifiers but without need to manually deal with application_id and sequence_ids.

From one view point it is kinda caching problem (one of 2 oldest, right?), but we have a Burr tracking server that solves this problem 😎 From another point of view – I am not sure that it should be part of burr "core" since it is more about serving, or like a "high-level" Burr application script dealing with bunch of burr-interfaced services like the tracking server.

So that's why I am looking for missing tracking server API operations to make the following possible: a script (kinda Burr app/graph runner) that

imports a Burr graph definition from some project module 2a. it checks for script flag --no-resume . If it is present then it just runs Burr app for the given inputs, entrypoint and halt config – just pass it through from the script arguments to Burr app builder. 2b. If no --no-resume is present (by default) then it connects to a Burr tracking server instance given URL in BURR_TRACKING_SERVER_URL env variable.
It takes a Burr project name from BURR_PROJECT_NAME env variable.
It uses tracking server API to fetch latest trace for the configured project name and the same inputs.
If it is found and it is in failed state, it tries to resume execution initializing state from this trace using state right before the failure.

Describe the solution you'd like A documented API of the Burr tracking server(s), with minimal set of operations required to make the aforementioned script happen.

Describe alternatives you've considered TODO

Additional context None

Nov 28 '24 22:11 rgareev