cloud-pipeline
cloud-pipeline copied to clipboard
Approach to dinamically (on the fly) provide system data from runs
Background In some cases it can be very beneficial to have a mechanism for runs to provide some data about itself (periodically sync specific data files from run instance to the some central location).
f.e. In Nextflow there is a trace.txt file for each nextflow run which can be a source for the very helpful information such as task statuses, resource consumption, etc.
For Cloud-Pipeline it will be very helpful to have unified mechanism for runs to provide such information on a fly.
Let's implement the following approach:
- new System Preference
launch.run.sync.data:
{
"syncTimeout": dd # timeout in sec how to configure CP_SYNC_TO_STORAGE_TIMEOUT_SEC
"data": {
"<data-type>": {
"storagePathPrefix": <path-prefix> # path prefix, used to store data in, f.e. storagePathPrefix = "s3://bucket/prefix" - > this data will be stored under "s3://bucket/prefix/<run-id>" path
}
}
}
- usage of he
sync_to_storagefunctionality inside a run to sync this data
Resulted schema would be look like: