SageMaker inference download locations are misconfigured, and models are downloaded twice

Open thvasilo opened this issue 1 year ago • 0 comments

GSF tries to download the models into /opt/ml/gsgnn_model, as seen here https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/python/graphstorm/sagemaker/sagemaker_infer.py#L173

One a job with large model (learnable embeddings included) we see this in the logs in terms of disk space:

Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   31G   90G  26% /
tmpfs            64M     0   64M   0% /dev
tmpfs           374G     0  374G   0% /sys/fs/cgroup
/dev/nvme0n1p1   70G   47G   24G  67% /usr/sbin/docker-init
/dev/nvme2n1   1008G  178G  779G  19% /tmp
shm             372G     0  372G   0% /dev/shm
/dev/nvme1n1    120G   31G   90G  26% /etc/hosts
tmpfs           374G     0  374G   0% /proc/acpi
tmpfs           374G     0  374G   0% /sys/firmware

The partition mounted under /, and I think that includes /opt , will only have 90GB available.

To be able to download larger datasets/models we need to be used the partition mounted under /tmp .

Also, in our inference launch script we define

https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/sagemaker/launch/launch_infer.py#L120

That will download the model data from the provided S3 path into /opt/ml/input/data/<channel_name> which by default for models will be /opt/ml/input/data/model (see the Estimator docs)

But then here, we try to download the model again, this time into /opt/ml/gsgnn_model

Aug 02 '24 23:08 thvasilo