uabrc.github.io Documenting more S3 transfer utilities

What would you like to see added?

Caveat!

Our understanding is that s5cmd uses md5 hashes to verify binary content integrity during uploads only, not downloads. For more intricate verification another tool will be required (e.g. of metadata or using another hash). A later post in this issue documents how to use rclone check.

Notes

--stat shows total files transferred, failed, successful, at the end of the job
--numworkers=$SLURM_CPUS_ON_NODE is perfect for a single-node job
--endpoint-url=https://s3.lts.rc.uab.edu/ is required for our S3 endpoint
mv will remove the file from the source!
cp is what we want until we've verified the files on the destination

Tests

Tests with 8 cpus and 8 GB memory on c0168:

39 files @ 1 GiB each: ~5.1 gbps
1000 files @ 10 MiB each: ~0.95 gbps

Tests with 100 cpus and 200 GB memory on c0202 (amd-hdr100)

1000 files @ 10 MiB each: ~8.0 gbps

Example

Sample commands to get timing and s5cmd cp (in a script):

#!/bin/bash

start_time="$(date -u +%s.%N)"

s5cmd --stat \
    --numworkers=$SLURM_CPUS_ON_NODE \
    --endpoint-url=https://s3.lts.rc.uab.edu/ \
    cp \
    SOURCE_PATH \
    s3://DESTINATION_PATH/

end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"

Other thoughts

We don't fully understand the cp flag --concurrency.

There are also open questions about the Rados Gateway frontend configuration.

file with config stuff is ceph.conf https://docs.ceph.com/en/latest/radosgw/config-ref/#ceph-object-gateway-config-reference
Max copy concurrency https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_max_copy_obj_concurrent_io
Max HTTP requests https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_max_concurrent_requests

Apr 14 '23 22:04 wwarriner

Single large file parallelization test:

#!/bin/bash

start_time="$(date -u +%s.%N)"

s5cmd --stat \
    --numworkers=$SLURM_CPUS_ON_NODE \
    --endpoint-url=https://s3.lts.rc.uab.edu/ \
    cp \
    --concurrency $SLURM_CPUS_ON_NODE \
    SOURCE_PATH \
    s3://DESTINATION_PATH/

end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"

Jun 29 '23 20:06 wwarriner

Checksum verification with rclone.

Configure rclone to work with LTS with rclone config creating an lts endpoint pointing to s3.lts.rc.uab.edu. See: https://docs.rc.uab.edu/data_management/transfer/rclone/#setting-up-an-s3-lts-remote. The name in the docs may be Ceph instead of lts.
mkdir ~/rclone-check-test
rclone copy lts:site-test ~/rclone-check-test
rclone check ~/rclone-check-test lts:site-test

You should see lines like the following after step 4.

2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 0 differences found
2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 2 matching files

Note that the bucket site-test is publicly available, containing an example static website. Go to https://s3.lts.rc.uab.edu/site-test/index.html to visit the page. It is possible to mimick this use case using #566

Jun 29 '23 20:06 wwarriner

We should show examples with sync too. That was really easy to use to move a whole tree.

Jul 17 '23 15:07 jprorama

How did you set up credentials? Env vars? ~/.aws/credentials ?

Aug 18 '23 20:08 rusalkaguy

Great question!

For all S3-related activities, I put the credentials in env vars only usable within that session. It can be a bit of a pain but it's more secure than storing them in plaintext. The Secret Access Key should be treated with the same level of security you would give to any other password, because that is its functional purpose.

I also put an env var for the endpoint url for convenience. This means I don't have to use --endpoint-url on every command. Both methods are valid alternatives.

# module load awscli      # not nearly as fast as s5cmd
#
# _OR_
# 
# module load Anaconda3
# conda activate s5cmd    # which you've already created separately

export AWS_ACCESS_KEY_ID=$your_access_key
export AWS_SECRET_ACCESS_KEY=$your_secret_access_key
export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/    # haven't tested this with s5cmd

# do what you need to do here

Docs for s5cmd here: https://github.com/peak/s5cmd#specifying-credentials Detailed info here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html And here: https://docs.aws.amazon.com/sdkref/latest/guide/feature-ss-endpoints.html

Aug 18 '23 21:08 wwarriner

Just to add a comment here, adding your access keys to the shell script actually makes them somewhat less secure than adding them to a credentials file because the shell scripts are saved in the job script archive, and that archive is accessible for everyone in RC. So setting your keys as environment variables would only be more secure for interactive moves, not batch jobs, and they could be saved in your bash history anyway. There is probably an answer for this somewhere, but I'm not sure saving as plain text in credential files is much less secure than the other options here

Sep 29 '23 17:09 mdefende

Great point. I'm not sure what the best option would be here.

Here is one potential option: https://docs.aws.amazon.com/secretsmanager/latest/userguide/security_cli-exposure-risks.html

Related bash history configuration:

Configure to ignore commands prefixed with literal space character: https://stackoverflow.com/questions/34753203/avoid-adding-bash-command-to-history
Suspend history: https://unix.stackexchange.com/a/10923

Sep 29 '23 18:09 wwarriner

Noticed that the currently available "awscli" modules in Cheaha are outdated and do not recognize the environment variable AWS_ENDPOINT_URL export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/ Installing the latest "awscli" within a conda environment recognized the variable AWS_ENDPOINT_URL . This was tested with s5cmd and Boto3, a Python library used to manage AWS services like S3 Boto3 documentation.

Jun 06 '24 14:06 Premas

awscli is installable on an individual basis. The module should be removed and replaced with instructions on how it should be installed if someone needs it on our docs.

Jun 06 '24 14:06 mdefende

I see the conda part now, sorry I should read the whole message before responding :)

Jun 06 '24 14:06 mdefende

The default value of --numworkers is 256, so this value must be set manually on all of our machines to avoid the usual issues with too many workers on one node. The recommended value in a Slurm job is $SLURM_CPUS_ON_NODE or similar.

Aug 16 '24 16:08 wwarriner