Documenting more S3 transfer utilities
What would you like to see added?
Caveat!
Our understanding is that s5cmd uses md5 hashes to verify binary content integrity during uploads only, not downloads. For more intricate verification another tool will be required (e.g. of metadata or using another hash). A later post in this issue documents how to use rclone check.
Notes
--statshows total files transferred, failed, successful, at the end of the job--numworkers=$SLURM_CPUS_ON_NODEis perfect for a single-node job--endpoint-url=https://s3.lts.rc.uab.edu/is required for our S3 endpointmvwill remove the file from the source!cpis what we want until we've verified the files on the destination
Tests
Tests with 8 cpus and 8 GB memory on c0168:
- 39 files @ 1 GiB each: ~5.1 gbps
- 1000 files @ 10 MiB each: ~0.95 gbps
Tests with 100 cpus and 200 GB memory on c0202 (amd-hdr100)
- 1000 files @ 10 MiB each: ~8.0 gbps
Example
Sample commands to get timing and s5cmd cp (in a script):
#!/bin/bash
start_time="$(date -u +%s.%N)"
s5cmd --stat \
--numworkers=$SLURM_CPUS_ON_NODE \
--endpoint-url=https://s3.lts.rc.uab.edu/ \
cp \
SOURCE_PATH \
s3://DESTINATION_PATH/
end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"
Other thoughts
We don't fully understand the cp flag --concurrency.
There are also open questions about the Rados Gateway frontend configuration.
- file with config stuff is
ceph.confhttps://docs.ceph.com/en/latest/radosgw/config-ref/#ceph-object-gateway-config-reference - Max copy concurrency https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_max_copy_obj_concurrent_io
- Max HTTP requests https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_max_concurrent_requests
Single large file parallelization test:
#!/bin/bash
start_time="$(date -u +%s.%N)"
s5cmd --stat \
--numworkers=$SLURM_CPUS_ON_NODE \
--endpoint-url=https://s3.lts.rc.uab.edu/ \
cp \
--concurrency $SLURM_CPUS_ON_NODE \
SOURCE_PATH \
s3://DESTINATION_PATH/
end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"
Checksum verification with rclone.
- Configure
rcloneto work with LTS withrclone configcreating anltsendpoint pointing tos3.lts.rc.uab.edu. See: https://docs.rc.uab.edu/data_management/transfer/rclone/#setting-up-an-s3-lts-remote. The name in the docs may beCephinstead oflts. mkdir ~/rclone-check-testrclone copy lts:site-test ~/rclone-check-testrclone check ~/rclone-check-test lts:site-test
You should see lines like the following after step 4.
2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 0 differences found
2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 2 matching files
Note that the bucket site-test is publicly available, containing an example static website. Go to https://s3.lts.rc.uab.edu/site-test/index.html to visit the page. It is possible to mimick this use case using #566
We should show examples with sync too. That was really easy to use to move a whole tree.
How did you set up credentials? Env vars? ~/.aws/credentials ?
Great question!
For all S3-related activities, I put the credentials in env vars only usable within that session. It can be a bit of a pain but it's more secure than storing them in plaintext. The Secret Access Key should be treated with the same level of security you would give to any other password, because that is its functional purpose.
I also put an env var for the endpoint url for convenience. This means I don't have to use --endpoint-url on every command. Both methods are valid alternatives.
# module load awscli # not nearly as fast as s5cmd
#
# _OR_
#
# module load Anaconda3
# conda activate s5cmd # which you've already created separately
export AWS_ACCESS_KEY_ID=$your_access_key
export AWS_SECRET_ACCESS_KEY=$your_secret_access_key
export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/ # haven't tested this with s5cmd
# do what you need to do here
Docs for s5cmd here: https://github.com/peak/s5cmd#specifying-credentials
Detailed info here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html
And here: https://docs.aws.amazon.com/sdkref/latest/guide/feature-ss-endpoints.html
Just to add a comment here, adding your access keys to the shell script actually makes them somewhat less secure than adding them to a credentials file because the shell scripts are saved in the job script archive, and that archive is accessible for everyone in RC. So setting your keys as environment variables would only be more secure for interactive moves, not batch jobs, and they could be saved in your bash history anyway. There is probably an answer for this somewhere, but I'm not sure saving as plain text in credential files is much less secure than the other options here
Great point. I'm not sure what the best option would be here.
Here is one potential option: https://docs.aws.amazon.com/secretsmanager/latest/userguide/security_cli-exposure-risks.html
Related bash history configuration:
- Configure to ignore commands prefixed with literal space character: https://stackoverflow.com/questions/34753203/avoid-adding-bash-command-to-history
- Suspend history: https://unix.stackexchange.com/a/10923
Noticed that the currently available "awscli" modules in Cheaha are outdated and do not recognize the environment variable AWS_ENDPOINT_URL
export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/
Installing the latest "awscli" within a conda environment recognized the variable AWS_ENDPOINT_URL . This was tested with s5cmd and Boto3, a Python library used to manage AWS services like S3 Boto3 documentation.
awscli is installable on an individual basis. The module should be removed and replaced with instructions on how it should be installed if someone needs it on our docs.
I see the conda part now, sorry I should read the whole message before responding :)
The default value of --numworkers is 256, so this value must be set manually on all of our machines to avoid the usual issues with too many workers on one node. The recommended value in a Slurm job is $SLURM_CPUS_ON_NODE or similar.