cloud-pipeline icon indicating copy to clipboard operation
cloud-pipeline copied to clipboard

Approach to dinamically (on the fly) provide system data from runs

Open SilinPavel opened this issue 11 months ago • 0 comments

Background In some cases it can be very beneficial to have a mechanism for runs to provide some data about itself (periodically sync specific data files from run instance to the some central location).

f.e. In Nextflow there is a trace.txt file for each nextflow run which can be a source for the very helpful information such as task statuses, resource consumption, etc.

For Cloud-Pipeline it will be very helpful to have unified mechanism for runs to provide such information on a fly.

Let's implement the following approach:

  • new System Preference launch.run.sync.data:
{
  "syncTimeout": dd # timeout in sec how to configure CP_SYNC_TO_STORAGE_TIMEOUT_SEC
  "data": {
    "<data-type>": {
      "storagePathPrefix": <path-prefix> # path prefix, used to store data in, f.e. storagePathPrefix = "s3://bucket/prefix" - > this data will be stored under "s3://bucket/prefix/<run-id>" path
    }
  }
}
  • usage of he sync_to_storage functionality inside a run to sync this data

Resulted schema would be look like: Untitled Diagram drawio

SilinPavel avatar Dec 19 '24 12:12 SilinPavel