joystream Colossus archive script

Colossus archive script

Open Lezek123 opened this issue 4 months ago • 0 comments

Addresses: https://github.com/Joystream/joystream/issues/5188

Overview

The archive mode is an operating mode for storage nodes, which allows them to:

sync/download all assigned data objects (just like regular storage nodes),
compresses them into 7zip archives,
upload them to an S3 bucket of choice.

No external API is exposed in archive mode.

Essential Parameters

Filesystem

uploadQueueDir: Directory for storing:
- fully downloaded data objects ready to be compressed into 7zip archives
- 7zip archives at all stages (until they are removed after successful upload)
- objects_trackfile - a file which tracks already downloaded data objects to avoid downloading them again
- archives_trackfile.jsonl - a file which keeps track of all uploaded archives and the files they contain. This is the only source of this information, so a copy of it is also periodically uploaded to S3 every --archiveTrackfileBackupFreqMinutes.
tmpDownloadDir: Temporary directory for in-progress downloads

CLI flags:

--uploadQueueDir=<PATH>          # Directory for downloaded objects
--tmpDownloadDir=<PATH>          # Directory for temporary downloads

S3 bucket config

S3 support was based on https://github.com/Joystream/joystream/pull/5175

CLI flags:

--awsS3BucketRegion=<REGION>     # AWS S3 bucket region
--awsS3BucketName=<NAME>         # AWS S3 bucket name

ENV variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Upload triggers

There are 3 parameters that control when then compression and upload flow is trigerred:

--localCountTriggerThreshold - objects are compressed and uploaded to S3 if the number of them (in the local directory) reaches this threshold.
--localSizeTriggerThresholdMB - objects are compressed and uploaded to S3 if the size of them (in the local directory) reaches this threshold.
--localAgeTriggerThresholdMinutes - objects are compressed and uploaded to S3 if the oldest of local objects was downloaded more than this many minutes ago

CLI flags:

--localCountTriggerThreshold=<N>           # Number of objects trigger
--localSizeTriggerThresholdMB=<MB>        # Total size trigger (default: 10000)
--localAgeTriggerThresholdMinutes=<MIN>    # Age trigger (default: 1440)

Size limits

--archiveFileSizeLimitMB - specifies the desired size limit of the 7z archives. The actual archives may be bigger depending on the size of some data objects, but generally the lower the limit, the more archive files will be created.
--uploadQueueDirSizeLimitMB - specifies the desired limit of the upload directory size. To leave a safe margin of error (for compression etc.), it should be set to ~50% of available disk space.

CLI flags:

--uploadQueueDirSizeLimitMB=<MB>    # Upload directory size limit (default: 20000)
--archiveFileSizeLimitMB=<MB>       # Max archive size (default: 1000)

Performance Tuning

CLI flags:

--uploadWorkersNumber=<N>     # Concurrent upload workers (default: 4)
--syncWorkersNumber=<N>       # Concurrent download workers (default: 8)
--syncInterval=<MIN>          # Minutes between syncs (default: 20)

Usage Example

storage-node archive \
  --worker=123 \
  --uploadQueueDir=/data/uploads \
  --tmpDownloadDir=/data/temp \
  --awsS3BucketRegion=us-east-1 \
  --awsS3BucketName=my-bucket \
  --localSizeTriggerThresholdMB=15000 \
  --uploadQueueDirSizeLimitMB=30000

Archive service loop

The loop executed by archive service consists of the following steps:

Data Integrity Check
- Verifies uploadQueueDir contents and:
  - Removes corrupted data (data objects in undetermined / conflicting state)
  - Removes .tmp.7z archives (they may be left of if the compression failed at some point, in this case objects will be re-downloaded and compression will be re-attempted later)
  - Re-schedules .7z archives for upload if not already uploaded
  - Removes already uploaded .7z archives if not yet removed for some reason
Sync Stage
- New data objects are being downloaded based on the selected / assigned buckets
- The downloads are paused when approaching --uploadQueueDirSizeLimitMB to avoid overflowing the disk space (since downloads may be faster than uploads)
- Compression and uploads can be triggered to happen in parallel to the downloads as soon as one of the 3 trigger thresholds is reached.
- Stage finishes when all data objects which don't exist in objects_trackfile, but exist in the selected buckets are either downloaded or fail to be downloaded (so during the first sync we're talking almost ~3,000,000 data objects right now, which may take several days) and there are no other pending tasks (like uploads in progress),
Final thresholds check stage
- Checks upload thresholds one last time, mostly to verify if localAgeTriggerThresholdMinutes is reached an if it is - triggers the compression & uploads.
Idle Stage
- Waits for configured --syncInterval before next cycle

Logging

It's recommended to use file logging (debug by default) and set COLOSSUS_DEFAULT_LOG_LEVEL=info env

Configuring with `env`

Most of the parameters can be provided purely through env variables, this is the config I used for my tests:

COLOSSUS_DEFAULT_LOG_LEVEL=info
DISABLE_BUCKET_AUTH=true

WORKER_ID=17

UPLOAD_QUEUE_DIR=/path/to/upload_dir
TMP_DOWNLOAD_DIR=/path/to/temp_dir
LOG_FILE_PATH=/path/to/logs

LOCAL_COUNT_TRIGGER_THRESHOLD=1000
LOCAL_SIZE_TRIGGER_THRESHOLD_MB=1000
LOCAL_AGE_TRIGGER_THRESHOLD_MINUTES=3600
ARCHIVE_FILE_SIZE_LIMIT_MB=500
UPLOAD_WORKERS_NUMBER=4
SYNC_WORKERS_NUMBER=4
SYNC_INTERVAL_MINUTES=5
ARCHIVE_TRACKFILE_BACKUP_FREQ_MINUTES=5

LOCALSTACK_ENABLED=true
AWS_REGION=us-east-1
AWS_BUCKET_NAME=localstack-archive
AWS_ACCESS_KEY_ID={MY_LOCALSTACK_KEY_ID}
AWS_SECRET_ACCESS_KEY={MY_LOCALSTACK_ACCESS_KEY}

Why compression?

PUT requests to S3 glacier deep archive have a price of $0.05 / 1000 requests (https://aws.amazon.com/s3/pricing/). Currently we have almost 3,000,000 data objects on Joystream mainnet, meaning we'd have to pay $150 just for the requests. By packing ~100 objects per archive, we can reduce this cost x100 (to $1.5)
Compression allows us to reduce the size of stored data by a few percent. We have 100 TB of data on Joystream right now. Each saved TB is another $1 / month.
As the number and total size of objects in the storage system will keep growing, the benefits from using compression will be even more pronounced.

There are also a few drawbacks of compression: it's more complicated and it could raise transfer costs in case data objects are often being fetched (since we need to fetch entire archives). It can also be demanding computationally. I've taken all of those factors into consideration, but I still think the benefits outweigh the costs even now and it will only become more evident in the future.

Oct 25 '24 13:10 Lezek123

joystream joystream copied to clipboard

Colossus archive script

Overview

Essential Parameters

Filesystem

S3 bucket config

Upload triggers

Size limits

Performance Tuning

Usage Example

Archive service loop

Logging

Configuring with env

Why compression?

joystream
joystream copied to clipboard

Configuring with `env`