cloud-pipeline S3 objects lifecycle management

Background

Requirements to the S3 objects lifecycle management :

Allow to specify multiple transition rules for a bucket based on file prefix and/or glob pattern
All available S3 storage archive classes shall be supported (S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive) as well as deletion of objects
User shall receive a notification on lifecycle events:
- User shall receive a notification when data is close up to N days to transition. Notification shall be sent with configurable delay
- User shall have an option to delay data transition using link from notification email
- User shall receive a notification once data transition is initiated
- Default template shall be available for such notifications, also notifications (text, subject, delays) shall be configurable per bucket
User shall be able to restore archived files and folders using CLI and GUI

A new object LifecyclePolicy shall be attached to an S3 bucket. It is similar to AWS BucketLifecycleConfiguration object and contains set of transition rules including filters (prefix, tags), storage class and transition days. This object shall also contain notification settings. E.g. in this object we specify that files with StorageType=bcl tag shall be deleted after 90 days and files with StorageType=fastq tag shall be moved to S3 Glacier Deep Archive after 180 days and deleted after 5 years.
When files are uploaded to storage after pipeline execution or from GUI/CLI files shall be automatically tagged. Tagging preferences shall be defined in a System Preference, e.g. [{"*.bcl" : "StorageType=bcl"}] or as a json file for a pipeline. Note no more than 10 tags are allowed. Upload API including generate URL api methods shall support tagging from SystemPreference as well.
Pipe CLI shall provide command for object native tagging using S3 batch requests

Object Lifecycle Monitor

A new daemon service shall monitor buckets with LifecyclePolicy attached using the following algorithm:

Monitoring is done for a configured folder in the bucket. Each folder under this prefix is considered to be a dataset. For a file structure below we have tree datasets (run1, run2, run3) for prefix data/

 bucket/
     data/
         run1/
         run2/
         run3/

Each dataset is processed individually. When a new dataset is detected a new LifecyclePolicy shall be created for this dataset from default template attached to the storage. A dataset without assigned LifecyclePolicy is considered to be new
Each file in the dataset is checked against the lifecycle policy. File is considered eligible when:
- it matches tags and prefix from the policy
- it not is target storage class
- its alive time is greater or equal then configured days value
- TBD: how to check file? If we use glob preference used for tagging, it is fast, but potentially we may consider not tagged files to be eligible for transition. Checking actual tags in S3 is very slow.
If some of the files are eligible for transition in N days and this matches notification settings (global or bucket), user shall receive a notification with link to delay transition. An API method shall be implemented to change expiration days for a dataset by user request
If some of the files are eligible for transition now a new AWS BucketLifecycleConfiguration shall be created with matching path/tags filter and expiration days 0, to be applied immediately. Check that such policy doesn't exist yet. TBD: All BCLs in a path will be transferred at once, even if some files were uploaded earlier/later
Any existing AWS BucketLifecycleConfiguration policy shall be dropped after configurable amount of days

Restoring files API and CLI methods shall be implemented to request object restoring from archive storage classes and monitor restoring process.

Jul 07 '22 12:07 mzueva

Comment on restoring files implementation:

Main aspects:

Server only initiate process of restoring files - Restore action created with status INITIATED
sls actually starts this process by creating a batch operation job to restore specified files (aws cloud) - status changed to RUNNING
After job is created in the next loops of sls service will try to check restoring status by head each involving file and see if Restoring header change value to false and have information about restored date if so - will update restoring status to SUCCEEDED and set appropriate restoredTill value
If there is already RUNNING restoring process - all related INITIATED actions will wait until it will be done
If there is several related INITIATED actions only one (the latest) will be applied, all others will be CANCELLED:

Example 1 of cancellation process:

We have the next hierarchy of object:

User initiate restore of folder /dataset/ Another user initiate restore of file /dataset/file1

In this case all restores will be applied, firstly /dataset/ will be restored and after that /dataset/file1 will be restored

Example 2 of cancellation process:

The same hierarchy as for Example 1.

But now: User initiate restore of file /dataset/file1 Another user initiate restore of folder /dataset/

In this case only restore for /dataset/ will be applied because /dataset/ includes /dataset/file1 and it is latest restore action

Sep 05 '22 13:09 SilinPavel

Docs were added via #2547 and can be found here.

Nov 07 '22 20:11 NShaforostov

cloud-pipeline cloud-pipeline copied to clipboard