deepforge
deepforge copied to clipboard
Clear intermediate data during pipeline execution
When executing a pipeline, it would be nice to opt out of storing intermediate data. This is especially the case when we are specifying the storage backend to use as the amount of space may be limited.
This will have implications for the ability to restart individual jobs in a pipeline.
It might be easier to just clear the pipeline data after a successful execution. Essentially, this should find all jobs that are not "Input" nodes then delete all associated data nodes.
This is a bit more involved:
- non-debug pipelines should probably have their data deleted on completion. There is probably no need for storing intermediate results as users are not able to edit and re-run portions of the pipeline.
-
debug pipelines should probably only delete their intermediate data on:
- restart
- deletion Additionally, data should probably be deleted when the associated artifact is deleted. However, this raises another question about how to manage tokens/backend authentication as we certainly don't want to prompt the user for storage config info on each deletion.
A better way to handle storage configuration could be through introducing a new concept (maybe "integrations"?) which stores the config info for the associated components (storage, compute, etc) and is stored for the given user.
On a related (but probably shouldn't be part of the same issue) note, it would be worth considering using temporary/scratch storage when supported by the storage adapter (such as "Temporary" for SciServer Files)...