cloud-pipeline
cloud-pipeline copied to clipboard
S3 objects lifecycle management
Background
Requirements to the S3 objects lifecycle management :
- Allow to specify multiple transition rules for a bucket based on file prefix and/or glob pattern
- All available S3 storage archive classes shall be supported (
S3 Glacier Instant Retrieval,S3 Glacier Flexible Retrieval,S3 Glacier Deep Archive) as well as deletion of objects - User shall receive a notification on lifecycle events:
- User shall receive a notification when data is close up to N days to transition. Notification shall be sent with configurable delay
- User shall have an option to delay data transition using link from notification email
- User shall receive a notification once data transition is initiated
- Default template shall be available for such notifications, also notifications (text, subject, delays) shall be configurable per bucket
- User shall be able to restore archived files and folders using CLI and GUI
Approach
- A new object
LifecyclePolicyshall be attached to an S3 bucket. It is similar to AWSBucketLifecycleConfigurationobject and contains set of transition rules including filters (prefix, tags), storage class and transition days. This object shall also contain notification settings. E.g. in this object we specify that files withStorageType=bcltag shall be deleted after 90 days and files withStorageType=fastqtag shall be moved toS3 Glacier Deep Archiveafter 180 days and deleted after 5 years. - When files are uploaded to storage after pipeline execution or from GUI/CLI files shall be automatically tagged. Tagging preferences shall be defined in a System Preference, e.g.
[{"*.bcl" : "StorageType=bcl"}]or as a json file for a pipeline. Note no more than 10 tags are allowed. Upload API including generate URL api methods shall support tagging from SystemPreference as well. - Pipe CLI shall provide command for object native tagging using S3 batch requests
Object Lifecycle Monitor
A new daemon service shall monitor buckets with LifecyclePolicy attached using the following algorithm:
- Monitoring is done for a configured folder in the bucket. Each folder under this prefix is considered to be a
dataset. For a file structure below we have tree datasets (run1,run2,run3) for prefixdata/
bucket/
data/
run1/
run2/
run3/
- Each dataset is processed individually. When a new dataset is detected a new
LifecyclePolicyshall be created for this dataset from default template attached to the storage. A dataset without assignedLifecyclePolicyis considered to be new - Each file in the dataset is checked against the lifecycle policy. File is considered eligible when:
- it matches tags and prefix from the policy
- it not is target storage class
- its alive time is greater or equal then configured
daysvalue - TBD: how to check file? If we use glob preference used for tagging, it is fast, but potentially we may consider not tagged files to be eligible for transition. Checking actual tags in S3 is very slow.
- If some of the files are eligible for transition in N days and this matches notification settings (global or bucket), user shall receive a notification with link to delay transition. An API method shall be implemented to change expiration days for a dataset by user request
- If some of the files are eligible for transition now a new AWS
BucketLifecycleConfigurationshall be created with matching path/tags filter and expiration days 0, to be applied immediately. Check that such policy doesn't exist yet. TBD: All BCLs in a path will be transferred at once, even if some files were uploaded earlier/later - Any existing AWS
BucketLifecycleConfigurationpolicy shall be dropped after configurable amount of days
Restoring files API and CLI methods shall be implemented to request object restoring from archive storage classes and monitor restoring process.
- restore folder
- get restoring status
- delete from glacier
- notification on completion TBD
- permanent restoring (change class back to
Standard) TBD
Comment on restoring files implementation:
Main aspects:
- Server only
initiateprocess of restoring files - Restore action created with statusINITIATED slsactually starts this process by creating a batch operation job to restore specified files (aws cloud) - status changed toRUNNING- After job is created in the next loops of
slsservice will try to check restoring status byheadeach involving file and see ifRestoringheader change value tofalseand have information about restored date if so - will update restoring status toSUCCEEDEDand set appropriaterestoredTillvalue - If there is already
RUNNINGrestoring process - all relatedINITIATEDactions will wait until it will be done - If there is several related
INITIATEDactions only one (the latest) will be applied, all others will beCANCELLED:
Example 1 of cancellation process:
We have the next hierarchy of object:
- /dataset/
- file1
- file2
- file3
User initiate restore of folder /dataset/ Another user initiate restore of file /dataset/file1
In this case all restores will be applied, firstly /dataset/ will be restored and after that /dataset/file1 will be restored
Example 2 of cancellation process:
The same hierarchy as for Example 1.
But now: User initiate restore of file /dataset/file1 Another user initiate restore of folder /dataset/
In this case only restore for /dataset/ will be applied because /dataset/ includes /dataset/file1 and it is latest restore action
Docs were added via #2547 and can be found here.