Hamilton UI - Configuration menu for clean up and tracked execution infos
Updated issue:
- A user cannot delete data easily from the Hamilton UI without going to the database manually.
Proposed solution:
- Expose server endpoint and UI view to delete projects, and runs; expose SDK functionality to do it programmatically.
Alternatives:
- #1232 helps mitigate the problem by enabling more configuration driven options to determine what is or is not stored.
---- Original Issue: --- Is your feature request related to a problem? Please describe. Currently, the Hamilton Tracker/UI saves a lot of information for every dataflow run, like success info, node parameters (inputs), node results(outputs), execution times,...
For ETL dataflows, that runs often and processes a lot of data during every run, this results in a lot of data stored in the postgres/sqlite db (might be a problem) and the RUNS section in the UI becomes unresponsive.
Personally, I am most interested in the success info, execution times and logs.
Describe the solution you'd like
- Configuration of regular cleanups for the tracker data of each run. Ideally, this can be done individual for every data type (success info, inputs, outputs, logs,...)
- Configuration of which kind of data should be stored by the tracker.
Thank you for opening the issue. Following the discussion on Slack, I know @skrawcz reproduced the issue yesterday. We're working on a fix!
Tagging this related issue: #921
CC @legout https://github.com/DAGWorks-Inc/hamilton/pull/1232 is a start. Let me know if you have comments. Should be able to build off of this and add more over time...
(closing for now; feel free to reopen!)
I'm going to reopen and modify the issue to focus on exposing endpoints in the server and UI to delete data.
@skrawcz This slack thread is related
https://hamilton-opensource.slack.com/archives/C03M33QB4M8/p1739280812916689
Yeah -- should be easy enough to delete all db items below a certain date. Extra requirements:
- Make idempotent -- if it fails, we should be able to run again
- (ideally) -- task-based -- likely going to be blocking for now as we don't have task-based capabilities so it's a bit more complex, but will see if django offers something out of the box
- (ideally) -- track prior calls/cleanups -- again, not for the first release but we can easily have a UI page + an endpoint for prior cleanup jobs
To add -- the NodeRunAttribute object stores most of the data, but you'd want to get rid of DAG runs, node runs, and attributes: https://github.com/DAGWorks-Inc/hamilton/blob/1057a40db370d235fce647d65872221365da6bfa/ui/backend/server/trackingserver_run_tracking/models.py
Maybe even templates...
Should be easiest to have an endpoint, could also have a script that you run.