SageMaker inference download locations are misconfigured, and models are downloaded twice
GSF tries to download the models into /opt/ml/gsgnn_model, as seen here https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/python/graphstorm/sagemaker/sagemaker_infer.py#L173
One a job with large model (learnable embeddings included) we see this in the logs in terms of disk space:
Filesystem Size Used Avail Use% Mounted on
overlay 120G 31G 90G 26% /
tmpfs 64M 0 64M 0% /dev
tmpfs 374G 0 374G 0% /sys/fs/cgroup
/dev/nvme0n1p1 70G 47G 24G 67% /usr/sbin/docker-init
/dev/nvme2n1 1008G 178G 779G 19% /tmp
shm 372G 0 372G 0% /dev/shm
/dev/nvme1n1 120G 31G 90G 26% /etc/hosts
tmpfs 374G 0 374G 0% /proc/acpi
tmpfs 374G 0 374G 0% /sys/firmware
The partition mounted under /, and I think that includes /opt , will only have 90GB available.
To be able to download larger datasets/models we need to be used the partition mounted under /tmp .
Also, in our inference launch script we define
https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/sagemaker/launch/launch_infer.py#L120
That will download the model data from the provided S3 path into /opt/ml/input/data/<channel_name> which by default for models will be /opt/ml/input/data/model (see the Estimator docs)
But then here, we try to download the model again, this time into /opt/ml/gsgnn_model