cloud-pipeline Approach to dinamically (on the fly) provide system data from runs

Approach to dinamically (on the fly) provide system data from runs

Open SilinPavel opened this issue 11 months ago • 0 comments

Background In some cases it can be very beneficial to have a mechanism for runs to provide some data about itself (periodically sync specific data files from run instance to the some central location).

f.e. In Nextflow there is a trace.txt file for each nextflow run which can be a source for the very helpful information such as task statuses, resource consumption, etc.

For Cloud-Pipeline it will be very helpful to have unified mechanism for runs to provide such information on a fly.

Let's implement the following approach:

new System Preference launch.run.sync.data:

{
  "syncTimeout": dd # timeout in sec how to configure CP_SYNC_TO_STORAGE_TIMEOUT_SEC
  "data": {
    "<data-type>": {
      "storagePathPrefix": <path-prefix> # path prefix, used to store data in, f.e. storagePathPrefix = "s3://bucket/prefix" - > this data will be stored under "s3://bucket/prefix/<run-id>" path
    }
  }
}

usage of he sync_to_storage functionality inside a run to sync this data

Resulted schema would be look like: Untitled Diagram drawio

Dec 19 '24 12:12 SilinPavel

cloud-pipeline cloud-pipeline copied to clipboard

Approach to dinamically (on the fly) provide system data from runs

cloud-pipeline
cloud-pipeline copied to clipboard