deepforge icon indicating copy to clipboard operation
deepforge copied to clipboard

Clear intermediate data during pipeline execution

Open brollb opened this issue 5 years ago • 3 comments

When executing a pipeline, it would be nice to opt out of storing intermediate data. This is especially the case when we are specifying the storage backend to use as the amount of space may be limited.

This will have implications for the ability to restart individual jobs in a pipeline.

brollb avatar Oct 16 '19 14:10 brollb

It might be easier to just clear the pipeline data after a successful execution. Essentially, this should find all jobs that are not "Input" nodes then delete all associated data nodes.

brollb avatar Oct 23 '19 13:10 brollb

This is a bit more involved:

  • non-debug pipelines should probably have their data deleted on completion. There is probably no need for storing intermediate results as users are not able to edit and re-run portions of the pipeline.
  • debug pipelines should probably only delete their intermediate data on:
    • restart
    • deletion Additionally, data should probably be deleted when the associated artifact is deleted. However, this raises another question about how to manage tokens/backend authentication as we certainly don't want to prompt the user for storage config info on each deletion.

A better way to handle storage configuration could be through introducing a new concept (maybe "integrations"?) which stores the config info for the associated components (storage, compute, etc) and is stored for the given user.

brollb avatar Nov 14 '19 23:11 brollb

On a related (but probably shouldn't be part of the same issue) note, it would be worth considering using temporary/scratch storage when supported by the storage adapter (such as "Temporary" for SciServer Files)...

brollb avatar Apr 10 '20 22:04 brollb