Add support for direct S3 access on SageMaker tasks
Because DistDGL and by extension GraphStorm has an assumption of a shared filesystem to function properly, in our SageMaker implementations need to implement various downloads and uploads to "fake" the existence of a shared filesystem, by downloading data locally to specific locations per instance.
This introduces a maintenance burden as we can't make the same environment assumptions for our SageMaker vs. EC2 with EFS execution, and introduces a lot of glue code, to make the two system compatible.
Mountpoint for S3 is an AWS project that allows entire S3 buckets to mounted onto EC2 instances and treated a (mostly) regular filesystem. If we are able to use S3 buckets as virtual shared filesystems for SageMaker we should be able to simplify and align the codebase. We note the use-cases suggested by the mountpoint-s3 project align with ours:
Mountpoint for Amazon S3 is optimized for applications that need high read throughput to large objects, potentially from many clients at once, and to write new objects sequentially from a single client at a time. This means it's a great fit for applications that use a file interface to: * read large objects from S3, potentially from many instances concurrently, without downloading them to local storage first * access only some S3 objects out of a larger data set, but can't predict which objects in advance * upload their output to S3 directly, or upload files from local storage with tools like cp
but probably not the right fit for applications that: * use file operations that S3 doesn't natively support, like directory renaming or symlinks * ( make edits to existing files (don't work on your Git repository or run vim in Mountpoint 😄)
We propose starting with a POC that modifies our SageMaker images and entry points to use mountpoint-s3, but does not affect the user-facing launch scripts, providing a backwards-compatible solution for our users.
Our first target will be adding GraphBolt support to SageMaker DistPartition, which is currently not possible, because DistDGL to GraphBolt partition conversion assumes that the leader instance has access to the entire distributed graph on disk. Following that, we can migrate our other SageMaker tasks to mountpoint-s3, where shared filesystems are normally required:
- [ ] DistPartition, remove download/upload of data from S3
- [ ] DistTraining
- [ ] DistInference
After some investigation it seems like using mountpoint-s3 might not be a viable solution because it requires containers to be launched in a specific way which SageMaker does not support. Will look instead into other SageMaker file modes, although for GraphBolt we require access to files that are created by the job and not pre-existing
https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html
EDIT: The file modes available on SageMaker do not allow reading files that are created on S3 during the training/processing job, which makes them hard to use for our purposes. In addition, streaming file modes create read-only file systems on SM containers, which does no allow e.g. DGL to convert DistDGL files to GraphBolt in-place.